KRIS-Bench: Benchmarking Next-Level
Intelligent Image Editing Models

Yongliang Wu1,4, Zonghui Li1, Xinting Hu2, Xinyu Ye3, Xianfang Zeng4, Gang Yu4, Wenbo Zhu5, Bernt Schiele2, Ming-Hsuan Yang6, Xu Yang1
1Southeast University, 2Max Planck Institute for Informatics, 3Shanghai Jiao Tong University 4StepFun, 5University of California, Berkeley, 6University of California, Merced
Corresponding author, Project leader

Abstract

Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, We introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1,267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on nine state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.

KRIS-Bench Overview

Leaderboard

Performance scores on KRIS-Bench are presented across three knowledge types: Factual, Conceptual, and Procedural, along with their corresponding reasoning dimensions.

# Model Factual Knowledge Conceptual Knowledge Procedural Knowledge Overall Score
Attribute Perception Spatial Perception Temporal Perception Average Score Social Science Natural Science Average Score Logical Reasoning Instruction Decomposition Average Score
1 GPT-4o

OpenAI

83.17 79.08 68.25 79.80 85.50 80.06 81.37 71.56 85.08 78.32 80.09
2 Gemini 2.0

Google

66.33 63.33 63.92 65.26 68.19 56.94 59.65 54.13 71.67 62.90 62.41
3 Doubao

ByteDance Doubao

70.92 59.17 40.58 63.30 65.50 61.19 62.23 47.75 60.58 54.17 60.70
4 BAGEL-Think

ByteDance Seed

66.42 67.75 0.00 55.77 59.63 59.38 59.44 51.19 27.33 39.26 53.36
5 BAGEL

ByteDance Seed

58.08 54.50 0.00 47.71 52.69 52.00 52.17 49.63 30.83 40.23 47.76
6 Step1X-Edit

StepFun

55.50 51.75 0.00 45.52 44.69 49.06 48.01 40.88 22.75 31.82 43.29
7 Emu2

BAAI

51.50 48.83 22.17 45.40 34.69 38.44 37.54 24.81 45.00 34.91 39.70
8 AnyEdit

ZJU

47.67 45.17 0.00 39.26 38.56 42.94 41.88 36.56 26.92 31.74 38.55
9 MagicBrush

OSU

53.92 39.58 0.00 41.84 42.94 38.06 39.24 30.00 23.08 26.54 37.15
10 OmniGen

BAAI

37.92 28.25 21.83 33.11 30.63 27.19 28.02 11.94 35.83 23.89 28.85
11 InstructPix2Pix

UCB

30.33 21.33 0.00 23.33 22.56 26.56 25.59 19.81 14.75 17.28 22.82

Task-Level Performance

Leaderboard Comparison

Note: If you would like to submit your results, please contact us at yongliang0223@gmail.com.

Benchmark Examples

KRIS-Bench Method

BibTeX

BibTex Code Here