Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, We introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1,267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on nine state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.
Performance scores on KRIS-Bench are presented across three knowledge types: Factual, Conceptual, and Procedural, along with their corresponding reasoning dimensions.
# | Model | Factual Knowledge | Conceptual Knowledge | Procedural Knowledge | Overall Score | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Attribute Perception | Spatial Perception | Temporal Perception | Average Score | Social Science | Natural Science | Average Score | Logical Reasoning | Instruction Decomposition | Average Score | |||
1 |
GPT-4o
OpenAI |
83.17 | 79.08 | 68.25 | 79.80 | 85.50 | 80.06 | 81.37 | 71.56 | 85.08 | 78.32 | 80.09 |
2 |
Gemini 2.0
|
66.33 | 63.33 | 63.92 | 65.26 | 68.19 | 56.94 | 59.65 | 54.13 | 71.67 | 62.90 | 62.41 |
3 |
Doubao
ByteDance Doubao |
70.92 | 59.17 | 40.58 | 63.30 | 65.50 | 61.19 | 62.23 | 47.75 | 60.58 | 54.17 | 60.70 |
4 |
BAGEL-Think
ByteDance Seed |
66.42 | 67.75 | 0.00 | 55.77 | 59.63 | 59.38 | 59.44 | 51.19 | 27.33 | 39.26 | 53.36 |
5 |
UniWorld-V1
PKU |
58.17 | 54.50 | 63.00 | 47.71 | 47.50 | 43.94 | 44.80 | 42.00 | 53.83 | 47.92 | 50.27 |
6 |
BAGEL
ByteDance Seed |
58.08 | 54.50 | 0.00 | 47.71 | 52.69 | 52.00 | 52.17 | 49.63 | 30.83 | 40.23 | 47.76 |
7 |
Step1X-Edit
StepFun |
55.50 | 51.75 | 0.00 | 45.52 | 44.69 | 49.06 | 48.01 | 40.88 | 22.75 | 31.82 | 43.29 |
8 |
Emu2
BAAI |
51.50 | 48.83 | 22.17 | 45.40 | 34.69 | 38.44 | 37.54 | 24.81 | 45.00 | 34.91 | 39.70 |
9 |
AnyEdit
ZJU |
47.67 | 45.17 | 0.00 | 39.26 | 38.56 | 42.94 | 41.88 | 36.56 | 26.92 | 31.74 | 38.55 |
10 |
MagicBrush
OSU |
53.92 | 39.58 | 0.00 | 41.84 | 42.94 | 38.06 | 39.24 | 30.00 | 23.08 | 26.54 | 37.15 |
11 |
OmniGen
BAAI |
37.92 | 28.25 | 21.83 | 33.11 | 30.63 | 27.19 | 28.02 | 11.94 | 35.83 | 23.89 | 28.85 |
12 |
InstructPix2Pix
UCB |
30.33 | 21.33 | 0.00 | 23.33 | 22.56 | 26.56 | 25.59 | 19.81 | 14.75 | 17.28 | 22.82 |
Note: If you would like to submit your results, please contact us at yongliang0223@gmail.com.
@article{wu2025kris,
title={KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models},
author={Wu, Yongliang and Li, Zonghui and Hu, Xinting and Ye, Xinyu and Zeng, Xianfang and Yu, Gang and Zhu, Wenbo and Schiele, Bernt and Yang, Ming-Hsuan and Yang, Xu},
journal={arXiv preprint arXiv:2505.16707},
year={2025}
}