Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

1University of Hong Kong 2Salesforce Research
*Equal contribution ^Corresponding authors

Abstract

AGUVIS is a unified pure vision-based framework for autonomous GUI agents that can operate across various platforms (web, desktop, mobile). Unlike previous approaches that rely on textual representations, AGUVIS leverages unified purely vision-based observations and a consistent action space to ensure better generalization across different platforms.

Key Features & Contributions

  • 🔍 Pure Vision Framework: First fully autonomous pure vision GUI agent capable of performing tasks independently without relying on closed-source models
  • 🔄 Cross-Platform Unification: Unified action space and plugin system that works consistently across different GUI environments
  • 📊 Comprehensive Dataset: Large-scale dataset of GUI agent trajectories with multimodal grounding and reasoning
  • 🧠 Two-Stage Training: Novel training pipeline focusing on GUI grounding followed by planning and reasoning
  • 💭 Inner Monologue: Explicit planning and reasoning capabilities integrated into the model training

Our framework demonstrates state-of-the-art performance in both offline and real-world online scenarios, offering a more efficient and generalizable approach to GUI automation.

🔥 AGUVIS on OSWorld

Planner Grounder OS Calc Impress Writer VLC TB Chrome VSC GIMP WF Avg.
GPT-4o 8.33 0.00 6.77 4.35 16.10 0.00 4.35 4.35 3.85 5.58 5.03
GPT-4o SoM 20.83 0.00 6.77 4.35 6.53 0.00 4.35 4.35 0.00 3.60 4.59
SecClick 16.67 0.00 12.76 4.35 23.52 6.67 10.86 8.70 11.54 7.92 9.21
OS-Atlas-Base-4B 20.83 2.23 14.89 8.70 23.52 13.33 15.22 13.04 15.38 7.92 11.65
OS-Atlas-Base-7B 25.00 4.26 17.02 8.70 29.41 26.67 19.57 17.39 19.23 8.91 14.63
AGUVIS-7B 41.67 4.26 8.51 17.38 17.65 26.67 17.23 17.39 34.62 5.58 14.79
AGUVIS-72B 20.83 4.26 11.03 13.04 12.41 20.00 15.06 17.39 11.54 3.60 10.26
Human 75.00 61.70 80.85 73.91 70.59 46.67 78.26 73.91 73.08 73.27 72.36

Training Pipeline

Explore AGUVIS Examples

Task Instruction

Export the current document into PDF, keep the file name.

Video Recording

Offline Experiments

ScreenSpot

Comparison of various planners and grounding methods on ScreenSpot across various device and input modalities. The top part of table shows the results on original instructions evaluation setting while the bottom part shows results on self-plan evaluation setting. Best results are in bold.

Multimodal-Mind2Web

Performance comparison on Multimodal Mind2Web across different settings. We report element accuracy (Ele.Acc), Operation F1 (Op.F1), and step success rate (Step SR). Best results are in bold. "T" means the textual HTML code as inputs. "I" means the GUI images as inputs.

AndroidControl

Step Accuracy of out-of-domain (OOD) data on AndroidControl under high-level tasks and low-level tasks. Best performance is in bold. "Acc.Tree" means the textual accessibility tree.

Online Experiments


Mind2Web-Live

Task Success Rate (SR) and efficiency costs on Mind2Web-Live. "USD" Efficiency is calculated by dividing the model's total inference cost in USD by the number of successful steps.


AndroidWorld

Task Success Rates (SR) on AndroidWorld and MobileMiniWob++. Best results are in bold.

BibTeX

@misc{xu2024aguvis,
        title={Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction},
        author={Yiheng Xu and Zekun Wang and Junli Wang and Dunjie Lu and Tianbao Xie and Amrita Saha and Doyen Sahoo and Tao Yu and Caiming Xiong},
        year={2024},
        eprint={2412.04454},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
    }