Architecture

When Clicking Is the Wrong Answer

Saheed2 min read

If the task is "sum column B in this spreadsheet," a computer use agent will literally try to click through cells and add numbers. It will read values off the screen, keep a running total in its context, and probably make an error somewhere around row 30.

The right approach is three lines of Python. Open the file, compute the sum, return the result. Faster, more reliable, and impossible to mess up.

The Limitation of GUI-Only Agents

Most computer use agents do not have this option. Their entire world is the screen: click, type, scroll. When they encounter a computational task, they solve it the only way they know how: through the GUI.

The Hybrid Approach: GUI Plus Code

A smarter architecture recognizes when a task is better solved programmatically and delegates to a code execution environment. The agent opens the file in the application (GUI), exports the data (GUI), then processes it (code). Seamless handoff between interaction and computation.

This matters more than you might think. Real-world RPA workflows frequently combine both. Navigate to a patient report in the EHR (GUI interaction). Export the report (GUI). Parse and extract structured data from the export (definitely code). Send the data to an API (code). Go back and confirm in the EHR that it was processed (GUI).

Trying to do the parsing step through the GUI would be slow, fragile, and error-prone. Trying to do the navigation through code would require brittle selectors that break on UI changes. The hybrid approach uses each tool for what it is good at.

The General Pattern

The pattern generalizes: any time your automation touches both user interfaces and data processing, you want an agent that can seamlessly switch between clicking and coding. Otherwise you are forcing a hammer to do the work of a screwdriver.

Share

Want to see this in action?

We ship EHR automations in weeks, not months. See what production looks like for your workflows.

Book a Demo