KRIS-Bench

KRIS-Bench: Benchmarking Next-Level
Intelligent Image Editing Models

¹Southeast University, ²Max Planck Institute for Informatics, ³Shanghai Jiao Tong University ⁴StepFun, ⁵University of California, Berkeley, ⁶University of California, Merced
^†Corresponding author, ^‡Project leader

Paper Code

🤗

KRIS-Bench

🤗

Evaluation Results

[2025-09-19] 🎉 KRIS-Bench is accepted by NeurIPS 2025.
[2025-08-04] 🔥 We update the official evaluation results of Uni-CoT. You can find the results at this link.
[2025-07-28] 🔥 We update the evaluation results of Flux.1 Kontext [Pro] and [Max].
[2025-07-24] 🔥 We update the evaluation results of Step 3o vision
[2025-07-15] 🔥 We update the evaluation results of Step1X-Edit v1.1
[2025-07-01] 🔥 We update the evaluation results of Flux.1 Kontext [Dev].
[2025-06-26] 🔥 We update the evaluation results of HiDream-E1.
[2025-06-25] 🔥 We update the evaluation results of OmniGen2.
[2025-06-19] 🔥 We update the evaluation results of ByteMorph.
[2025-06-18] 🔥 We update the official evaluation results from BAGEL team, please refer to this link.
[2025-06-11] 🔥 We update the evaluation results of UniWorld-V1.

Abstract

Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, We introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1,267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on nine state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.

Leaderboard

Model	Factual Knowledge				Conceptual Knowledge			Procedural Knowledge			Overall Score
Model	Attribute Perception	Spatial Perception	Temporal Perception	Average Score	Social Science	Natural Science	Average Score	Logical Reasoning	Instruction Decomposition	Average Score	Overall Score
GPT-4o OpenAI	83.17	79.08	68.25	79.80	85.50	80.06	81.37	71.56	85.08	78.32	80.09
Uni-CoT SAIS	72.76	72.87	67.10	71.85	70.81	66.00	67.16	53.43	73.93	63.68	68.00
Gemini 2.0 Google	66.33	63.33	63.92	65.26	68.19	56.94	59.65	54.13	71.67	62.90	62.41
Step 3o vision StepFun	69.67	61.08	63.25	66.70	66.88	60.88	62.32	49.06	54.92	51.99	61.43
Doubao ByteDance	70.92	59.17	40.58	63.30	65.50	61.19	62.23	47.75	60.58	54.17	60.70
BAGEL-Think ByteDance	67.42	68.33	58.67	66.18	63.55	61.40	61.92	48.12	50.22	49.02	60.18
BAGEL ByteDance	64.27	62.42	42.45	60.26	55.40	56.01	55.86	52.54	50.56	51.69	56.21
FLUX.1 Kontext [Max] Black Forest Labs	71.25	69.17	0.00	59.04	60.88	56.06	57.22	50.38	40.83	45.60	55.12
FLUX.1 Kontext [Pro] Black Forest Labs	69.42	70.17	0.00	58.14	55.44	54.94	55.06	50.12	43.25	46.69	54.17
Step1X-Edit v1.1 StepFun	64.17	61.75	0.00	53.05	52.06	55.06	54.34	52.56	36.75	44.66	51.59
UniWorld-V1 PKU	58.17	54.50	63.00	47.71	47.50	43.94	44.80	42.00	53.83	47.92	50.27
OmniGen2 BAAI	59.92	52.25	54.75	57.36	47.56	43.12	44.20	32.50	63.08	47.79	49.71
FLUX.1 Kontext [Dev] Black Forest Labs	64.83	60.92	0.00	53.28	48.94	50.81	50.36	46.06	39.00	42.53	49.54
ByteMorph ByteDance	61.17	62.00	0.00	51.27	45.50	47.38	46.92	32.00	31.33	31.67	44.85
HiDream-E1 HiDream.ai	52.75	49.42	0.00	43.31	52.56	49.25	50.05	45.19	30.08	37.64	44.72
Step1X-Edit StepFun	55.50	51.75	0.00	45.52	44.69	49.06	48.01	40.88	22.75	31.82	43.29
Emu2 BAAI	51.50	48.83	22.17	45.40	34.69	38.44	37.54	24.81	45.00	34.91	39.70
AnyEdit ZJU	47.67	45.17	0.00	39.26	38.56	42.94	41.88	36.56	26.92	31.74	38.55
MagicBrush OSU	53.92	39.58	0.00	41.84	42.94	38.06	39.24	30.00	23.08	26.54	37.15
OmniGen BAAI	37.92	28.25	21.83	33.11	30.63	27.19	28.02	11.94	35.83	23.89	28.85
InstructPix2Pix UCB	30.33	21.33	0.00	23.33	22.56	26.56	25.59	19.81	14.75	17.28	22.82

Note: If you would like to submit your results, please contact us at yongliang0223@gmail.com.

BibTeX

@article{wu2025kris, title={KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models}, author={Wu, Yongliang and Li, Zonghui and Hu, Xinting and Ye, Xinyu and Zeng, Xianfang and Yu, Gang and Zhu, Wenbo and Schiele, Bernt and Yang, Ming-Hsuan and Yang, Xu}, journal={arXiv preprint arXiv:2505.16707}, year={2025} }

KRIS-Bench: Benchmarking Next-LevelIntelligent Image Editing Models

Abstract

Leaderboard

Benchmark Examples

BibTeX

KRIS-Bench: Benchmarking Next-Level
Intelligent Image Editing Models