KRIS-Bench: Benchmarking Next-Level
Intelligent Image Editing Models

1Southeast University, 2Max Planck Institute for Informatics, 3Shanghai Jiao Tong University 4StepFun, 5University of California, Berkeley, 6University of California, Merced
Corresponding author, Project leader

Abstract

Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, We introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1,267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on nine state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.

KRIS-Bench Overview

Leaderboard

Model Factual Knowledge Conceptual Knowledge Procedural Knowledge Overall Score
Attribute Perception Spatial Perception Temporal Perception Average Score Social Science Natural Science Average Score Logical Reasoning Instruction Decomposition Average Score
GPT-4o

OpenAI

83.17 79.08 68.25 79.80 85.50 80.06 81.37 71.56 85.08 78.32 80.09
Gemini 2.0

Google

66.33 63.33 63.92 65.26 68.19 56.94 59.65 54.13 71.67 62.90 62.41
Step 3o vision

StepFun

69.67 61.08 63.25 66.70 66.88 60.88 62.32 49.06 54.92 51.99 61.43
Doubao

ByteDance

70.92 59.17 40.58 63.30 65.50 61.19 62.23 47.75 60.58 54.17 60.70
BAGEL-Think

ByteDance

67.42 68.33 58.67 66.18 63.55 61.40 61.92 48.12 50.22 49.02 60.18
BAGEL

ByteDance

64.27 62.42 42.45 60.26 55.40 56.01 55.86 52.54 50.56 51.69 56.21
FLUX.1 Kontext [Max]

Black Forest Labs

71.25 69.17 0.00 59.04 60.88 56.06 57.22 50.38 40.83 45.60 55.12
FLUX.1 Kontext [Pro]

Black Forest Labs

69.42 70.17 0.00 58.14 55.44 54.94 55.06 50.12 43.25 46.69 54.17
Step1X-Edit v1.1

StepFun

64.17 61.75 0.00 53.05 52.06 55.06 54.34 52.56 36.75 44.66 51.59
UniWorld-V1

PKU

58.17 54.50 63.00 47.71 47.50 43.94 44.80 42.00 53.83 47.92 50.27
OmniGen2

BAAI

59.92 52.25 54.75 57.36 47.56 43.12 44.20 32.50 63.08 47.79 49.71
FLUX.1 Kontext [Dev]

Black Forest Labs

64.83 60.92 0.00 53.28 48.94 50.81 50.36 46.06 39.00 42.53 49.54
ByteMorph

ByteDance

61.17 62.00 0.00 51.27 45.50 47.38 46.92 32.00 31.33 31.67 44.85
HiDream-E1

HiDream.ai

52.75 49.42 0.00 43.31 52.56 49.25 50.05 45.19 30.08 37.64 44.72
Step1X-Edit

StepFun

55.50 51.75 0.00 45.52 44.69 49.06 48.01 40.88 22.75 31.82 43.29
Emu2

BAAI

51.50 48.83 22.17 45.40 34.69 38.44 37.54 24.81 45.00 34.91 39.70
AnyEdit

ZJU

47.67 45.17 0.00 39.26 38.56 42.94 41.88 36.56 26.92 31.74 38.55
MagicBrush

OSU

53.92 39.58 0.00 41.84 42.94 38.06 39.24 30.00 23.08 26.54 37.15
OmniGen

BAAI

37.92 28.25 21.83 33.11 30.63 27.19 28.02 11.94 35.83 23.89 28.85
InstructPix2Pix

UCB

30.33 21.33 0.00 23.33 22.56 26.56 25.59 19.81 14.75 17.28 22.82

Note: If you would like to submit your results, please contact us at yongliang0223@gmail.com.

Benchmark Examples

KRIS-Bench Method

BibTeX

@article{wu2025kris,
        title={KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models},
        author={Wu, Yongliang and Li, Zonghui and Hu, Xinting and Ye, Xinyu and Zeng, Xianfang and Yu, Gang and Zhu, Wenbo and Schiele, Bernt and Yang, Ming-Hsuan and Yang, Xu},
        journal={arXiv preprint arXiv:2505.16707},
        year={2025}
      }