Recording Desktop RPA Scripts: From Manual Clicks to Python Automation

Writing RPA scripts by hand is tedious. You click somewhere, note the coordinates, write the code, test it, realize you got the timing wrong, adjust, test again. It's slow, error-prone, and by the time you're done, you've forgotten half the steps.

There's a better way: record your actions first, then convert them to code.

This approach lets you perform the task manually once while capturing every click and keystroke. Then you label what you did, feed it to an LLM, and get a working Python script. It's faster, more accurate, and gives you a reference for what the automation should actually do.

Here's how to build this workflow.

The Problem with Writing RPA Scripts from Scratch

When you're building desktop automation, you need to know:

Where to click (coordinates or keyboard shortcuts)
What to type (and in which fields)
How long to wait between actions
What the expected state is after each step

Trying to figure this all out while writing code is like trying to write directions to somewhere you've never been. You're guessing at coordinates, guessing at timing, and testing constantly.

Recording first, then coding, flips this around. You perform the task once while the computer watches, then you review what happened and turn it into automation.

Building a Simple Action Recorder

The core of this approach is recording your mouse and keyboard activity. You need a lightweight keylogger/action recorder that captures:

Mouse clicks (position and button)
Keyboard input (keys pressed)
Timestamps (for timing)
Active window (for context)

Basic Python Recorder

Here's a simple recorder using pynput:

from pynput import mouse, keyboard from datetime import datetime import json from pathlib import Path class ActionRecorder: def __init__(self, output_file="recording.json"): self.output_file = output_file self.actions = [] self.recording = False self.start_time = None def on_click(self, x, y, button, pressed): """Record mouse clicks""" if not self.recording or not pressed: return elapsed = (datetime.now() - self.start_time).total_seconds() self.actions.append({ 'type': 'click', 'x': x, 'y': y, 'button': str(button), 'timestamp': elapsed, 'label': '' # Will be filled in later }) print(f"[{elapsed:.2f}s] Click at ({x}, {y}) with {button}") def on_press(self, key): """Record keyboard presses""" if not self.recording: # Start/stop recording with F9 if key == keyboard.Key.f9: if self.start_time is None: self.start_recording() else: self.stop_recording() return elapsed = (datetime.now() - self.start_time).total_seconds() try: key_char = key.char except AttributeError: key_char = str(key) self.actions.append({ 'type': 'keypress', 'key': key_char, 'timestamp': elapsed, 'label': '' }) print(f"[{elapsed:.2f}s] Key press: {key_char}") def start_recording(self): """Start recording actions""" print("=" * 50) print("Recording started! Press F9 to stop.") print("=" * 50) self.recording = True self.start_time = datetime.now() self.actions = [] def stop_recording(self): """Stop recording and save""" print("=" * 50) print("Recording stopped!") print(f"Captured {len(self.actions)} actions") print("=" * 50) self.recording = False self.save_actions() def save_actions(self): """Save recorded actions to JSON""" with open(self.output_file, 'w') as f: json.dump(self.actions, f, indent=2) print(f"Actions saved to {self.output_file}") def run(self): """Start the recorder""" print("Action Recorder Ready") print("Press F9 to start/stop recording") # Set up listeners mouse_listener = mouse.Listener(on_click=self.on_click) keyboard_listener = keyboard.Listener(on_press=self.on_press) mouse_listener.start() keyboard_listener.start() # Keep running keyboard_listener.join() if __name__ == '__main__': recorder = ActionRecorder('my_recording.json') recorder.run()

Install dependencies:

pip install pynput

Usage:

Run the script: python recorder.py
Press F9 to start recording
Perform your task (open app, click buttons, enter data)
Press F9 to stop
Find your actions in my_recording.json

Recording Example Output

After recording a simple task (opening Notepad and typing something), your JSON looks like:

[ { "type": "keypress", "key": "Key.cmd", "timestamp": 0.5, "label": "" }, { "type": "keypress", "key": "n", "timestamp": 0.6, "label": "" }, { "type": "keypress", "key": "o", "timestamp": 0.7, "label": "" }, { "type": "keypress", "key": "t", "timestamp": 0.8, "label": "" }, { "type": "click", "x": 450, "y": 320, "button": "Button.left", "timestamp": 3.2, "label": "" }, { "type": "keypress", "key": "H", "timestamp": 4.1, "label": "" }, { "type": "keypress", "key": "e", "timestamp": 4.2, "label": "" } ]

This is raw data not very useful yet. The magic happens in the next step.

The Critical Step: Labeling Your Actions

Raw coordinates and keystrokes are meaningless without context. You need to review the recording and label what each action represents.

This is the thought loop:

You enacted the scenario (recorded it)
Now re-label the events so it's easy to remember WHY you clicked where and WHAT you were trying to achieve

Creating a Labeling Interface

Build a simple script to help you add labels:

import json from pathlib import Path def label_recording(recording_file): """Interactive labeling of recorded actions""" with open(recording_file, 'r') as f: actions = json.load(f) print(f"Loaded {len(actions)} actions from {recording_file}") print("=" * 60) print("Add labels to describe what each action does") print("Press Enter to skip an action") print("=" * 60) for idx, action in enumerate(actions): print(f"\nAction {idx + 1}/{len(actions)}:") print(f" Type: {action['type']}") print(f" Time: {action['timestamp']:.2f}s") if action['type'] == 'click': print(f" Position: ({action['x']}, {action['y']})") print(f" Button: {action['button']}") elif action['type'] == 'keypress': print(f" Key: {action['key']}") # Get label from user label = input(" Label (describe what this does): ").strip() if label: action['label'] = label else: action['label'] = f"Step {idx + 1}" # Save labeled recording output_file = recording_file.replace('.json', '_labeled.json') with open(output_file, 'w') as f: json.dump(actions, f, indent=2) print(f"\n✓ Labeled recording saved to {output_file}") return output_file if __name__ == '__main__': labeled_file = label_recording('my_recording.json')

Example labeling session:

Action 1/7: Type: keypress Key: Key.cmd Time: 0.50s Label: Open Windows start menu Action 2/7: Type: keypress Key: n Time: 0.60s Label: (skip - part of typing "notepad") Action 3/7: Type: keypress Key: o Time: 0.70s Label: (skip) Action 4/7: Type: keypress Key: t Time: 0.80s Label: (skip) Action 5/7: Type: click Position: (450, 320) Button: Button.left Time: 3.20s Label: Click in notepad text area Action 6/7: Type: keypress Key: H Time: 4.10s Label: Type "Hello"

After labeling, your JSON becomes:

[ { "type": "keypress", "key": "Key.cmd", "timestamp": 0.5, "label": "Open Windows start menu" }, { "type": "keypress", "key": "n", "timestamp": 0.6, "label": "Type 'notepad' to search" }, { "type": "click", "x": 450, "y": 320, "button": "Button.left", "timestamp": 3.2, "label": "Click in notepad text area" }, { "type": "keypress", "key": "H", "timestamp": 4.1, "label": "Type 'Hello' message" } ]

Now you have semantic meaning attached to low-level actions.

Converting to Python with an LLM

This is where the magic happens. Feed your labeled recording to an LLM and get a Python script.

Prompt Template

import anthropic import json def generate_rpa_script(labeled_recording_file, output_script_file="automation.py"): """Use Claude to convert labeled recording to Python script""" # Load labeled recording with open(labeled_recording_file, 'r') as f: actions = json.load(f) # Create prompt prompt = f""" I have a recording of desktop actions that I want to convert into a Python RPA script using pyautogui. Here are the recorded actions with labels describing what each does: {json.dumps(actions, indent=2)} Please create a Python script that: 1. Uses pyautogui for automation 2. Groups related actions into well-named functions 3. Adds appropriate time.sleep() delays between actions 4. Includes error handling 5. Has clear comments explaining each step 6. Starts and ends in a clean state (closes any windows opened) Important notes: - Prefer keyboard shortcuts over coordinate-based clicks when possible - Add logging to track progress - Make functions reusable - Include a main() function that orchestrates everything Generate a complete, working Python script. """ # Call Claude client = anthropic.Anthropic(api_key="your-api-key") message = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=4000, messages=[ {"role": "user", "content": prompt} ] ) # Extract Python code from response script_content = message.content[0].text # Save to file with open(output_script_file, 'w') as f: f.write(script_content) print(f"✓ Python script generated: {output_script_file}") return output_script_file # Usage generate_rpa_script('my_recording_labeled.json', 'notepad_automation.py')

Example Generated Script

Claude might generate something like:

import pyautogui import time import logging # Set up logging logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s') def open_notepad(): """Open Notepad application""" logging.info("Opening Notepad") # Press Windows key pyautogui.press('win') time.sleep(0.5) # Type 'notepad' to search pyautogui.write('notepad') time.sleep(0.5) # Press Enter to launch pyautogui.press('enter') time.sleep(2) # Wait for Notepad to open logging.info("Notepad opened") def type_message(message): """Type a message in the active text area""" logging.info(f"Typing message: {message}") # Click in text area to ensure focus (using recorded coordinates) pyautogui.click(450, 320) time.sleep(0.3) # Type the message pyautogui.write(message) time.sleep(0.5) logging.info("Message typed") def close_notepad(): """Close Notepad without saving""" logging.info("Closing Notepad") # Alt+F4 to close pyautogui.hotkey('alt', 'f4') time.sleep(0.5) # Don't save (press 'n' for No) pyautogui.press('n') time.sleep(0.5) logging.info("Notepad closed") def main(): """Main automation workflow""" try: logging.info("Starting automation") # Ensure we start in a clean state close_all_notepads() # Helper to close any existing instances # Perform the automation open_notepad() type_message("Hello from RPA!") # End in a clean state close_notepad() logging.info("Automation completed successfully") except Exception as e: logging.error(f"Automation failed: {e}") # Try to clean up close_notepad() raise def close_all_notepads(): """Ensure no Notepad windows are open""" # Implementation to close any existing Notepad windows pass if __name__ == '__main__': main()

What the LLM did:

Grouped low-level actions into semantic functions
Replaced coordinate clicks with keyboard shortcuts where possible
Added proper delays and error handling
Created reusable, well-named functions
Added logging for debugging
Ensured clean start/end state

The Jupyter Notebook Advantage

Once you have your generated script, break it into a Jupyter notebook for easier debugging:

automation.ipynb:

# Cell 1: Setup import pyautogui import time import logging logging.basicConfig(level=logging.INFO) # Cell 2: Open Notepad def open_notepad(): logging.info("Opening Notepad") pyautogui.press('win') time.sleep(0.5) pyautogui.write('notepad') time.sleep(0.5) pyautogui.press('enter') time.sleep(2) # Test this step open_notepad() # Cell 3: Type Message def type_message(message): logging.info(f"Typing: {message}") pyautogui.click(450, 320) time.sleep(0.3) pyautogui.write(message) time.sleep(0.5) # Test this step type_message("Testing!") # Cell 4: Close Notepad def close_notepad(): logging.info("Closing Notepad") pyautogui.hotkey('alt', 'f4') time.sleep(0.5) pyautogui.press('n') time.sleep(0.5) # Test this step close_notepad() # Cell 5: Full Workflow def main(): open_notepad() type_message("Full automation test") close_notepad() main()

Why Jupyter notebooks are perfect for RPA:

Step-by-step execution: Run one cell at a time to test each function individually

Easy replays: If step 3 fails, just re-run that cell without starting over

Visual feedback: See output and logs inline as you develop

Quick iterations: Modify a function and immediately re-test it

Documentation: Add markdown cells explaining what each step does

When debugging, you can:

Run cells 1-2 to get to a specific state
Modify cell 3's timing or coordinates
Re-run just cell 3 to test
Continue with cell 4 once it works

This is infinitely faster than running the entire script from scratch after every change.

Start and End in the Same State (Critical!)

The LLM should always generate code that follows this principle, but double-check:

Bad approach:

def automate_task(): open_application() enter_data() click_submit() # Done! (but application is still open)

Good approach:

def automate_task(): # Clean state at start close_all_application_windows() time.sleep(1) try: open_application() enter_data() click_submit() finally: # Always clean up, even if something fails close_all_application_windows() time.sleep(1)

Why this matters:

You know exactly where the script failed (if windows are still open)
Scripts can run repeatedly without interference from previous runs
Easier to debug (consistent starting state)
Prevents resource leaks (open windows accumulating)

Advanced Recording Features

Capturing Window Context

Enhance your recorder to capture which window was active:

import win32gui def get_active_window(): """Get the title of the active window""" try: window = win32gui.GetForegroundWindow() return win32gui.GetWindowText(window) except: return "Unknown" # In your recorder action['window'] = get_active_window()

This helps the LLM understand context: "User clicked in the 'Customer Entry' window at (450, 320)" is more useful than just coordinates.

Recording Wait Times

Capture how long you naturally waited between actions:

# In your actions list if len(self.actions) > 0: last_timestamp = self.actions[-1]['timestamp'] wait_time = elapsed - last_timestamp # If you waited more than 2 seconds, it's probably intentional if wait_time > 2.0: self.actions.append({ 'type': 'wait', 'duration': wait_time, 'timestamp': elapsed, 'label': 'Waiting for application to load' })

Capturing Screenshots

Save a screenshot with each click to help with labeling later:

import pyautogui from PIL import Image def on_click(self, x, y, button, pressed): if not self.recording or not pressed: return # Take screenshot screenshot = pyautogui.screenshot() screenshot_path = f"screenshots/click_{len(self.actions)}.png" screenshot.save(screenshot_path) # Add to action self.actions.append({ 'type': 'click', 'x': x, 'y': y, 'button': str(button), 'screenshot': screenshot_path, 'label': '' })

When labeling, you can see what was on screen when you clicked.

Complete Workflow Example

Let's put it all together with a realistic example: automating data entry in an old CRM.

Step 1: Record the task

python recorder.py # Press F9, perform your task, press F9 again

Step 2: Label the recording

python labeler.py my_recording.json

You label each action:

"Open CRM application"
"Click in customer name field"
"Type customer name"
"Tab to address field"
"Type address"
"Click Save button"
"Close confirmation dialog"

Step 3: Generate Python script

python generate_script.py my_recording_labeled.json

Claude produces a script with functions like:

open_crm()
enter_customer_data(name, address)
save_customer()
close_crm()

Step 4: Create Jupyter notebook

Copy the generated script into cells, one function per cell.

Step 5: Debug step-by-step

Run each cell individually, adjusting timing and coordinates as needed.

Step 6: Export final script

Once everything works, export as a standalone Python file.

Step 7: Add to your automation library

Now you have a reusable script for CRM data entry.

Tips for Better Recordings

Keep it simple: Record one workflow at a time. Don't try to capture everything in one recording.

Go slow: When recording, perform actions slightly slower than normal. This gives better timing data.

Minimize mouse clicks: If you can navigate with keyboard, do it. The recording will translate better to code.

Pause between steps: Give the LLM clear boundaries between logical steps.

Record multiple times: If the first recording is messy, just record again. It's quick.

Label immediately: Don't wait days to label. Do it right after recording while fresh in your mind.

Common Issues and Solutions

Problem: Generated script has wrong coordinates

Solution: Your display scaling might be off. Check Windows display settings and re-record, or manually adjust coordinates in the generated script.

Problem: Script runs too fast

Solution: Increase delays in Jupyter, find the right timing, then update the script.

Problem: Actions grouped incorrectly

Solution: Add more descriptive labels. Instead of "Click button", say "Click the Save button in the customer dialog".

Problem: LLM generates overly complex code

Solution: Simplify your prompt. Ask for "simple, readable code" and specify you want basic pyautogui commands.

Ethical Considerations

Recording keystrokes can capture sensitive information like passwords. Best practices:

Use test data only when recording
Don't record on production systems with real customer data
Delete recordings after generating scripts if they contain any sensitive info
Run your recorder only when needed, not constantly
Store recordings securely if they must be kept

This is for automating your own work, not monitoring others.

Final Thoughts

Recording first, then converting to code, flips the traditional RPA development process on its head. Instead of guessing at coordinates and timing while writing code, you perform the task once naturally, label what you did, and let AI turn it into a script.

The workflow:

Record actions with a simple keylogger
Label events with semantic meaning
Feed to LLM for Python generation
Debug in Jupyter notebook cells
Export to production script

This approach is faster, more accurate, and gives you a reference recording if you need to troubleshoot later. And with Jupyter notebooks, you can replay and debug individual steps without running the entire automation from scratch each time.

Stop writing RPA scripts by hand. Record them, label them, and let AI do the translation.

Authors

Faizaan Chishtie

Copy Link