VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks
We welcome contributions to the leaderboard!
Contact: zhoubeitong.zbt@antgroup.com
Dataset Composition
Domain distribution across the VenusBench dataset
Performance Comparison
The performance of representative models on advanced grounding tasks are significantly lower than on basic tasks, highlighting the increased difficulty and reasoning demands.
Humane Performance vs. state-of-the-art (SOTA) on grounding tasks. A significant performance gap persists, particularly in advanced grounding scenarios.
Experimental Results
Performance comparison on VenusBench-GD dataset categorized by the evaluation tasks.
Dataset Visualization
Examples of basic grounding tasks, illustrating both correct and incorrect matches between generated instructions and their corresponding annotated bounding boxes.
Examples of advanced grounding tasks. In the refusal grounding task, the red bounding box indicates the original UI element. After modification of the instruction, no matching element exists in the image.
BibTeX
@misc{zhou2025venusbenchgdcomprehensivemultiplatformgui,
title={VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks},
author={Beitong Zhou and Zhexiao Huang and Yuan Guo and Zhangxuan Gu and Tianyu Xia and Zichen Luo and Fei Tang and Dehan Kong and Yanyi Shang and Suling Ou and Zhenlin Guo and Changhua Meng and Shuheng Shen},
year={2025},
eprint={2512.16501},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.16501},
}