Architecture

Split Reasoning from Grounding: Why Two Models Beat One

Saheed2 min read

Most computer use agents use a single model for everything: reasoning about the task, deciding the next action, and predicting where to click on screen. This is convenient but fundamentally a compromise. Reasoning and visual grounding are different skills with different error profiles.

Why One Model Is Not Enough

Think about it this way: deciding "I need to click the submit button in this dialog" is a language and reasoning task. Predicting "the submit button is at pixel coordinates (523, 847)" is a vision task. These require different types of intelligence.

When you ask one model to do both, it handles each adequately but neither exceptionally. The gap becomes obvious on cluttered UIs: settings panels with dozens of toggles, dialogs with multiple similar-looking buttons, dense data entry forms. Exactly the kind of interfaces you encounter in enterprise desktop software.

How We Split Reasoning from Grounding

We split the responsibilities. A general-purpose language model handles the reasoning: what action to take, why, and what element to target. A specialized vision model handles the grounding: given a natural language description of an element and a screenshot, predict the exact coordinates.

Each model does one job well instead of both jobs passably.

The improvement in accuracy was the single biggest performance gain we have achieved. Not incrementally better. Meaningfully better. On complex enterprise UIs, the difference between "adequate" and "reliable" grounding is the difference between a demo and a production system.

One practical detail: the grounding model operates at a fixed internal resolution. We scale coordinates to the actual screen resolution at runtime. This means the system works on any display setup without retraining.

The Broader Lesson for Production AI

The broader lesson here applies beyond computer use agents. When you have a model doing two fundamentally different things and the performance is not where you need it, try splitting into specialized components before you try scaling up or fine-tuning. Specialization often beats generalization for production workloads.

Share

Want to see this in action?

We ship EHR automations in weeks, not months. See what production looks like for your workflows.

Book a Demo