VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

Zhou, Beitong; Huang, Zhexiao; Guo, Yuan; Gu, Zhangxuan; Xia, Tianyu; Luo, Zichen; Tang, Fei; Kong, Dehan; Shang, Yanyi; Ou, Suling; Guo, Zhenlin; Meng, Changhua; Shen, Shuheng

VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

Beitong Zhou^*, Zhexiao Huang^*, Yuan Guo^*, Zhangxuan Gu^*, Tianyu Xia, Zichen Luo, Fei Tang, Dehan Kong, Yanyi Shang, Suling Ou, Zhenlin Guo, Changhua Meng, Shuheng Shen

Venus Team, Ant Group
arXiv 2025
^*Indicates Equal Contribution

We welcome contributions to the leaderboard!

Contact: zhoubeitong.zbt@antgroup.com

Paper Code arXiv 🤗 Dataset Leaderboard

The overview of VenusBench-GD benchmark. VenusBench-GD integrates basic and advanced grounding tasks to comprehensively evaluation the capabilities of existing GUI models. Basic tasks assess the ability to recognize local UI elements, while advanced tasks require holistic reasoning over the entire interface and its underlying application functionality, demanding a more complex and global understanding.

Dataset Composition

3 Platforms

10 Domains

97+ Applications

6100+ Sample Pairs

Domain distribution across the VenusBench dataset

Benchmark Statistics

The dataset statistics of VenusBench-GD reveal a diverse and challenging distribution across key dimensions. a) The image resolutions span a wide range, with a significant proportion concentrated in common screen sizes. b) UI element sizes vary substantially relative to the image area, covering a broad spectrum from very small to large elements. c) Meanwhile, instruction lengths exhibit a rich distribution, peaking in mid-length queries but extending to longer, more complex descriptions.

Performance Comparison

The performance of representative models on advanced grounding tasks are significantly lower than on basic tasks, highlighting the increased difficulty and reasoning demands.

Humane Performance vs. state-of-the-art (SOTA) on grounding tasks. A significant performance gap persists, particularly in advanced grounding scenarios.

Experimental Results

Performance comparison on VenusBench-GD dataset categorized by the evaluation tasks.

Dataset Visualization

Examples of basic grounding tasks, illustrating both correct and incorrect matches between generated instructions and their corresponding annotated bounding boxes.

Examples of advanced grounding tasks. In the refusal grounding task, the red bounding box indicates the original UI element. After modification of the instruction, no matching element exists in the image.

BibTeX

@misc{zhou2025venusbenchgdcomprehensivemultiplatformgui,
      title={VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks}, 
      author={Beitong Zhou and Zhexiao Huang and Yuan Guo and Zhangxuan Gu and Tianyu Xia and Zichen Luo and Fei Tang and Dehan Kong and Yanyi Shang and Suling Ou and Zhenlin Guo and Changhua Meng and Shuheng Shen},
      year={2025},
      eprint={2512.16501},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.16501}, 
}

More Works from Our Team

UI-Venus Technical Report: Building High-performance UI Agents with RFT

VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

Dataset Composition

Domain distribution across the VenusBench dataset

Benchmark Statistics

Performance Comparison

Experimental Results

Performance comparison on VenusBench-GD dataset categorized by the evaluation tasks.

Dataset Visualization

BibTeX