Error Handling in RPA: Building Scripts That Survive the Chaos

RPA scripts are fragile.

You can write the most elegant automation, test it a hundred times, deploy it confidently to production, and it will still fail in ways you never imagined. A dialog box appears. The network hiccups. The application takes 3 seconds to load instead of 2. A field that's always enabled is suddenly disabled. Chaos reigns.

The difference between an RPA script that's a maintenance nightmare and one that actually works in production isn't avoiding errors it's handling them gracefully when they inevitably occur.

Here's how to build RPA scripts that survive contact with reality.

1. When in Doubt, Restart Everything

This is the nuclear option, and it's also the most reliable recovery mechanism you have.

The principle: If anything seems off, if any check fails, if you're not in the expected state close everything and start over from a clean slate.

Why this works: RPA scripts fail because the application gets into an unexpected state. Maybe a window opened in the wrong position. Maybe a dialog you dismissed is still somehow affecting focus. Maybe the database locked a record. Rather than trying to diagnose and fix the specific issue (which you often can't do programmatically), just reset to a known-good state and retry.

def run_with_reset_on_error(automation_function, max_retries=3): """ Wrapper that resets everything and retries on any error """ for attempt in range(max_retries): try: # Always start clean close_all_application_windows() clear_clipboard() reset_to_desktop() time.sleep(1) # Run the actual automation result = automation_function() # If we got here, success! return result except Exception as e: logging.error(f"Attempt {attempt + 1} failed: {e}") # Take screenshot of failure state timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") pyautogui.screenshot(f'error_attempt_{attempt}_{timestamp}.png') if attempt < max_retries - 1: logging.info(f"Resetting and retrying... ({max_retries - attempt - 1} attempts remaining)") # Nuclear option: close EVERYTHING close_all_application_windows() kill_application_process() # Force kill if needed time.sleep(2) else: logging.error("All retry attempts exhausted") raise finally: # Always clean up, even on success close_all_application_windows()

When to restart:

Any unexpected dialog appears
Elements aren't where you expect them
Timeouts waiting for windows or controls
Clipboard returns unexpected data
Application becomes unresponsive

Restart strategies by severity:

Soft restart: Close windows, return to home screen

def soft_restart(): pyautogui.hotkey('alt', 'f4') # Close current window time.sleep(0.5) pyautogui.press('escape') # Cancel any dialogs pyautogui.press('escape') navigate_to_home_screen()

Medium restart: Close application, reopen

def medium_restart(): close_all_application_windows() time.sleep(1) launch_application() wait_for_application_ready() login_if_needed()

Hard restart: Kill process, clear temp files, cold start

def hard_restart(): kill_application_process() clear_application_temp_files() clear_application_cache() time.sleep(2) launch_application() wait_for_application_ready() login_if_needed()

Pro tip: Keep track of which restart level you're at. If soft restart fails, try medium. If medium fails, go hard.

def run_with_escalating_restarts(automation_function): restart_strategies = [ ('soft', soft_restart), ('medium', medium_restart), ('hard', hard_restart) ] for restart_name, restart_func in restart_strategies: try: restart_func() result = automation_function() return result except Exception as e: logging.error(f"Failed after {restart_name} restart: {e}") continue raise Exception("All restart strategies failed")

2. Verbose Logging Your Debugging Lifeline

When your RPA script fails at 3 AM on a production server you can't access, logs are the only window into what happened.

Log everything. And I mean everything.

import logging from datetime import datetime # Set up detailed logging logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', handlers=[ logging.FileHandler(f'rpa_log_{datetime.now().strftime("%Y%m%d")}.log'), logging.StreamHandler() # Also print to console ] ) def process_invoice(invoice_id): logging.info(f"{'='*50}") logging.info(f"Starting invoice processing for ID: {invoice_id}") logging.info(f"{'='*50}") try: # Log before every action logging.info("Opening invoice screen") open_invoice_screen() logging.info("Invoice screen opened successfully") logging.info(f"Searching for invoice: {invoice_id}") search_invoice(invoice_id) logging.info("Search completed") # Log state checks if not is_invoice_found(): logging.warning(f"Invoice {invoice_id} not found in system") return None logging.info("Invoice found, proceeding with extraction") # Log data extraction logging.info("Extracting invoice data") data = extract_invoice_data() logging.info(f"Extracted data: {data}") # Log validation logging.info("Validating extracted data") if not validate_invoice_data(data): logging.error(f"Data validation failed for invoice {invoice_id}") logging.error(f"Invalid data: {data}") raise ValueError("Data validation failed") logging.info("Data validation passed") logging.info(f"Invoice {invoice_id} processed successfully") return data except Exception as e: logging.error(f"Fatal error processing invoice {invoice_id}: {e}") logging.error(f"Error type: {type(e).__name__}") logging.error(f"Error details: {str(e)}") # Log the state at time of failure logging.error("Attempting to capture failure state...") try: active_window = get_active_window_title() logging.error(f"Active window at failure: {active_window}") except: logging.error("Could not determine active window") raise

What to log:

Start and end of every function: "Entering function X", "Function X completed"
Before and after every action: "Clicking save button", "Save button clicked"
All extracted data: "Clipboard contents: [data]", "OCR result: [text]"
State checks: "Verifying window is active", "Confirming field is enabled"
Wait operations: "Waiting for dialog to appear", "Timeout waiting for element"
Decisions: "Validation passed, continuing", "Error detected, triggering restart"
Timing information: "Operation took 2.3 seconds", "Exceeded expected duration"

Log levels for different situations:

logging.debug("Tab count is 5") # Detailed debugging info logging.info("Processing invoice 12345") # Normal operation logging.warning("Window took 3 seconds to appear (expected 1s)") # Concerning but not fatal logging.error("Failed to find Save button") # Error that might be recoverable logging.critical("Application crashed, unable to restart") # Fatal error

Context managers for action logging:

from contextlib import contextmanager @contextmanager def log_action(action_name): """Context manager to log start and end of actions""" logging.info(f"Starting: {action_name}") start_time = time.time() try: yield duration = time.time() - start_time logging.info(f"Completed: {action_name} (took {duration:.2f}s)") except Exception as e: duration = time.time() - start_time logging.error(f"Failed: {action_name} after {duration:.2f}s - {e}") raise # Usage with log_action("Open customer form"): open_customer_form() with log_action("Enter customer data"): enter_customer_name("John Smith") enter_customer_email("john@example.com")

Log rotation to prevent disk space issues:

from logging.handlers import RotatingFileHandler # Keep max 10 log files of 10MB each handler = RotatingFileHandler( 'rpa_log.log', maxBytes=10*1024*1024, # 10MB backupCount=10 )

3. Screen Recording: Your Time Machine for Debugging

Logs tell you what the script thought was happening. Screen recordings show you what actually happened.

Use OBS Studio or similar to record your RPA scripts running. This is especially critical during:

Initial development and testing
Stress testing
First production runs
After any script modifications

Setting up automatic screen recording:

import subprocess import signal class ScreenRecorder: def __init__(self, output_path): self.output_path = output_path self.process = None def start(self): """Start OBS recording via command line""" # Requires OBS with obs-cli or similar plugin logging.info(f"Starting screen recording: {self.output_path}") # Example using ffmpeg for screen recording self.process = subprocess.Popen([ 'ffmpeg', '-f', 'gdigrab', # Windows screen capture '-framerate', '10', # Lower framerate to save space '-i', 'desktop', '-c:v', 'libx264', '-preset', 'ultrafast', self.output_path ], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) time.sleep(2) # Give it time to start def stop(self): """Stop recording""" if self.process: logging.info("Stopping screen recording") self.process.send_signal(signal.CTRL_C_EVENT) self.process.wait() def run_with_recording(automation_function): """Run automation with automatic screen recording""" timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") recorder = ScreenRecorder(f'rpa_recording_{timestamp}.mp4') try: recorder.start() result = automation_function() return result except Exception as e: logging.error(f"Error occurred during recording - video saved for review") raise finally: recorder.stop()

Why screen recording is invaluable:

Invisible issues: Sometimes errors happen so fast you can't see them. Slow down the recording and watch frame-by-frame.

Mystery dialogs: "A dialog appeared but I don't know which one." Watch the recording, pause on the dialog, read it.

Timing problems: "The script clicks too fast." Watch the recording to see exactly what's happening and adjust delays.

Regression testing: Compare recordings before and after changes to see what's different.

Training data: If you're building ML models or using LLMs to generate automation, recordings provide training examples.

Proof for stakeholders: "The automation failed because of a network timeout." Show them the recording of the spinning loader that never finishes.

Screen recording best practices:

Lower framerate (5-10 fps) saves massive disk space without losing useful info
Record only the application window instead of full screen when possible
Auto-delete successful runs after a few days to save space
Keep failure recordings indefinitely for analysis
Add timestamp overlay to recordings to correlate with logs

def cleanup_old_recordings(days_old=7, keep_errors=True): """Delete old recording files to save disk space""" cutoff_date = datetime.now() - timedelta(days=days_old) for recording in Path('recordings').glob('*.mp4'): # Keep error recordings if keep_errors and 'error' in recording.name: continue # Delete old successful recordings if recording.stat().st_mtime < cutoff_date.timestamp(): logging.info(f"Deleting old recording: {recording}") recording.unlink()

4. Chunk Your Script Into Error-Recoverable Steps

Monolithic RPA scripts are debugging nightmares. Break everything into small, self-contained steps with clear error boundaries.

Bad approach:

def process_customer_order(): # 200 lines of code # If line 150 fails, good luck figuring out what state you're in open_application() login() search_customer() verify_customer() open_order_form() enter_line_items() calculate_totals() apply_discount() submit_order() print_confirmation() email_customer() close_application()

Good approach:

class OrderProcessingSteps: """Each step is independent and can recover from errors""" def __init__(self): self.state = { 'customer_id': None, 'order_id': None, 'line_items': [], 'total': 0 } def step_1_initialize(self): """Open application and log in""" logging.info("STEP 1: Initialize application") try: with log_action("Launch application"): open_application() with log_action("Login"): login() with log_action("Navigate to orders screen"): navigate_to_orders() return True except Exception as e: logging.error(f"Step 1 failed: {e}") return False def step_2_find_customer(self, customer_id): """Search and verify customer""" logging.info(f"STEP 2: Find customer {customer_id}") try: with log_action("Search customer"): search_customer(customer_id) with log_action("Verify customer details"): customer_data = extract_customer_data() if not validate_customer_data(customer_data): raise ValueError("Customer validation failed") self.state['customer_id'] = customer_id return True except Exception as e: logging.error(f"Step 2 failed: {e}") # Can retry from beginning of this step return False def step_3_create_order(self, line_items): """Create new order with line items""" logging.info("STEP 3: Create order") try: with log_action("Open new order form"): open_new_order_form() for idx, item in enumerate(line_items): with log_action(f"Add line item {idx+1}/{len(line_items)}"): add_line_item(item) # Save state after each item in case we need to resume self.state['line_items'].append(item) with log_action("Calculate totals"): total = calculate_order_total() self.state['total'] = total return True except Exception as e: logging.error(f"Step 3 failed: {e}") logging.error(f"Completed {len(self.state['line_items'])} of {len(line_items)} items") return False def step_4_submit_order(self): """Submit and confirm order""" logging.info("STEP 4: Submit order") try: with log_action("Submit order"): submit_order() with log_action("Get order confirmation"): order_id = extract_order_id() self.state['order_id'] = order_id logging.info(f"Order submitted successfully: {order_id}") return True except Exception as e: logging.error(f"Step 4 failed: {e}") return False def process_customer_order_with_recovery(customer_id, line_items): """Process order with step-by-step recovery""" processor = OrderProcessingSteps() # Each step can be retried independently if not retry_step(processor.step_1_initialize): raise Exception("Failed to initialize") if not retry_step(lambda: processor.step_2_find_customer(customer_id)): raise Exception(f"Failed to find customer {customer_id}") if not retry_step(lambda: processor.step_3_create_order(line_items)): # We know exactly how many items were added before failure logging.error(f"Order creation failed after {len(processor.state['line_items'])} items") raise Exception("Failed to create order") if not retry_step(processor.step_4_submit_order): raise Exception("Failed to submit order") return processor.state['order_id'] def retry_step(step_function, max_attempts=3): """Retry a step with restart between attempts""" for attempt in range(max_attempts): try: result = step_function() if result: return True except Exception as e: logging.error(f"Attempt {attempt + 1} failed: {e}") if attempt < max_attempts - 1: logging.info("Restarting application before retry") soft_restart() return False

Benefits of chunked steps:

Clear failure points: "Step 3 failed" is infinitely more useful than "Something failed"
Resumable workflows: If Step 3 of 5 fails, you can potentially resume from Step 3
Independent testing: Test each step in isolation
State tracking: Know exactly what was completed before failure
Targeted retries: Only retry the failed step, not the entire workflow

5. Write Assertions Test Your Assumptions

RPA scripts fail silently. You click a button, nothing happens, and the script merrily continues thinking everything is fine. Until it's not.

Write assertions to verify every assumption.

def assert_window_is_active(window_title, timeout=5): """Verify expected window is active""" start_time = time.time() while time.time() - start_time < timeout: active = get_active_window_title() if window_title.lower() in active.lower(): logging.info(f"✓ Confirmed active window: {window_title}") return True time.sleep(0.5) # Take screenshot of unexpected state actual_window = get_active_window_title() pyautogui.screenshot(f'assertion_failed_window_{int(time.time())}.png') raise AssertionError( f"Expected window '{window_title}' to be active, " f"but found '{actual_window}'" ) def assert_field_contains(expected_value, field_name="field"): """Verify a field contains expected data""" # Extract actual value (via clipboard or OCR) pyautogui.hotkey('ctrl', 'a') pyautogui.hotkey('ctrl', 'c') time.sleep(0.1) actual_value = pyperclip.paste() if actual_value.strip() != expected_value.strip(): logging.error(f"✗ Field '{field_name}' assertion failed") logging.error(f" Expected: '{expected_value}'") logging.error(f" Actual: '{actual_value}'") pyautogui.screenshot(f'assertion_failed_field_{int(time.time())}.png') raise AssertionError( f"Field '{field_name}' contains '{actual_value}', " f"expected '{expected_value}'" ) logging.info(f"✓ Confirmed field '{field_name}' = '{expected_value}'") def assert_element_exists(element_description, check_function): """Verify an element exists before proceeding""" if not check_function(): logging.error(f"✗ Element '{element_description}' not found") pyautogui.screenshot(f'assertion_failed_element_{int(time.time())}.png') raise AssertionError(f"Required element '{element_description}' not found") logging.info(f"✓ Confirmed element exists: {element_description}")

Use assertions throughout your script:

def enter_customer_order(): # Assert we're on the right screen assert_window_is_active("Order Entry") # Navigate to customer field for _ in range(3): pyautogui.press('tab') # Enter customer ID customer_id = "12345" pyautogui.write(customer_id) # Assert it was entered correctly assert_field_contains(customer_id, "Customer ID") # Move to next field pyautogui.press('tab') time.sleep(0.5) # Assert customer name populated (auto-filled by application) pyperclip.copy('') pyautogui.hotkey('ctrl', 'c') time.sleep(0.1) customer_name = pyperclip.paste() if not customer_name or len(customer_name) < 2: raise AssertionError("Customer name did not auto-populate - invalid customer ID?") logging.info(f"✓ Customer name auto-populated: {customer_name}") # Continue with order...

Common assertions:

Window state: Is the expected window active?
Field values: Does the field contain what we just entered?
Element visibility: Is the button/menu we need present?
Data validation: Is the extracted data in the expected format?
Application responsiveness: Did the application respond within expected time?
Dialog absence: Are there any unexpected popups blocking progress?

def assert_no_error_dialogs(): """Check for common error dialog titles""" error_titles = ["Error", "Warning", "Failed", "Cannot", "Invalid"] active_window = get_active_window_title() for error_word in error_titles: if error_word.lower() in active_window.lower(): logging.error(f"✗ Error dialog detected: {active_window}") pyautogui.screenshot(f'error_dialog_{int(time.time())}.png') raise AssertionError(f"Error dialog appeared: {active_window}") logging.debug("✓ No error dialogs detected")

6. Build an Error Recovery Decision Tree

Not all errors are equal. Some should trigger immediate restart, others should retry the specific step, and some might be fatal.

class RPAErrorHandler: """Centralized error handling with recovery strategies""" def __init__(self): self.error_history = [] def handle_error(self, error, context): """ Decide how to handle an error based on type and context Args: error: The exception that occurred context: Dict with info about where/when error happened Returns: Action to take: 'retry', 'restart', 'skip', or 'fatal' """ self.error_history.append({ 'error': str(error), 'type': type(error).__name__, 'context': context, 'timestamp': datetime.now() }) error_type = type(error).__name__ error_message = str(error).lower() # Timeout errors - usually safe to retry if 'timeout' in error_message or 'timed out' in error_message: logging.warning("Timeout detected - will retry") return 'retry' # Window not found - restart application if 'window' in error_message and 'not found' in error_message: logging.warning("Window not found - will restart") return 'restart' # Data validation errors - skip this record if isinstance(error, ValueError) or 'validation' in error_message: logging.warning("Validation error - will skip") return 'skip' # Clipboard empty - retry the copy operation if 'clipboard' in error_message: logging.warning("Clipboard issue - will retry") return 'retry' # Element not found - restart might help if 'not found' in error_message or 'could not find' in error_message: logging.warning("Element not found - will restart") return 'restart' # Network errors - retry with backoff if 'network' in error_message or 'connection' in error_message: logging.warning("Network error - will retry with delay") time.sleep(5) # Wait before retry return 'retry' # Repeated same error - escalate recent_errors = [e['error'] for e in self.error_history[-3:]] if recent_errors.count(str(error)) >= 3: logging.error("Same error repeated 3 times - treating as fatal") return 'fatal' # Unknown error - default to restart logging.warning(f"Unknown error type: {error_type} - will restart") return 'restart' # Usage error_handler = RPAErrorHandler() def process_with_smart_error_handling(items): """Process items with intelligent error recovery""" results = [] for item in items: attempt = 0 max_attempts = 3 while attempt < max_attempts: try: result = process_single_item(item) results.append({'item': item, 'status': 'success', 'result': result}) break except Exception as e: attempt += 1 action = error_handler.handle_error(e, { 'item': item, 'attempt': attempt, 'function': 'process_single_item' }) if action == 'retry': logging.info(f"Retrying item {item} (attempt {attempt})") continue elif action == 'restart': logging.info("Restarting application") soft_restart() continue elif action == 'skip': logging.warning(f"Skipping item {item} due to validation error") results.append({'item': item, 'status': 'skipped', 'error': str(e)}) break elif action == 'fatal': logging.critical(f"Fatal error processing item {item}") results.append({'item': item, 'status': 'fatal', 'error': str(e)}) raise if attempt >= max_attempts: logging.error(f"Exhausted retries for item {item}") results.append({'item': item, 'status': 'failed', 'error': 'Max retries exceeded'}) return results

Putting It All Together

Here's what a production-ready RPA script looks like with all these error handling techniques:

def production_rpa_workflow(items_to_process): """ Production-grade RPA with comprehensive error handling """ # Initialize error_handler = RPAErrorHandler() recorder = ScreenRecorder(f'workflow_{datetime.now().strftime("%Y%m%d_%H%M%S")}.mp4') results = [] try: # Start recording recorder.start() logging.info(f"{'='*60}") logging.info(f"Starting workflow: {len(items_to_process)} items to process") logging.info(f"{'='*60}") # Step 1: Initialize with retry if not retry_step(initialize_application, max_attempts=3): raise Exception("Failed to initialize application") # Process each item for idx, item in enumerate(items_to_process): logging.info(f"Processing item {idx+1}/{len(items_to_process)}: {item}") attempt = 0 max_attempts = 3 while attempt < max_attempts: try: # Checkpoint: verify we're in good state assert_window_is_active("Main Application") assert_no_error_dialogs() # Process item in steps processor = ItemProcessor() with log_action(f"Step 1: Open item {item}"): processor.step_1_open_item(item) with log_action(f"Step 2: Extract data"): data = processor.step_2_extract_data() # Assert data is valid if not data or len(data) == 0: raise AssertionError("Extracted data is empty") with log_action(f"Step 3: Process data"): result = processor.step_3_process(data) # Success! results.append({ 'item': item, 'status': 'success', 'result': result }) logging.info(f"✓ Successfully processed item {item}") break except Exception as e: attempt += 1 pyautogui.screenshot(f'error_item_{item}_attempt_{attempt}.png') action = error_handler.handle_error(e, { 'item': item, 'index': idx, 'attempt': attempt }) if action == 'retry' and attempt < max_attempts: logging.info(f"Retrying item {item}") continue elif action == 'restart' and attempt < max_attempts: logging.info("Performing restart before retry") soft_restart() initialize_application() continue elif action == 'skip': results.append({ 'item': item, 'status': 'skipped', 'error': str(e) }) break else: results.append({ 'item': item, 'status': 'failed', 'error': str(e), 'attempts': attempt }) break # Summary successful = sum(1 for r in results if r['status'] == 'success') failed = sum(1 for r in results if r['status'] == 'failed') skipped = sum(1 for r in results if r['status'] == 'skipped') logging.info(f"{'='*60}") logging.info(f"Workflow complete: {successful} succeeded, {failed} failed, {skipped} skipped") logging.info(f"{'='*60}") return results except Exception as e: logging.critical(f"Workflow failed catastrophically: {e}") raise finally: # Always clean up recorder.stop() close_all_application_windows() # Save results with open(f'results_{datetime.now().strftime("%Y%m%d_%H%M%S")}.json', 'w') as f: json.dump(results, f, indent=2)

Final Thoughts

RPA without error handling is a ticking time bomb. RPA with error handling is a reliable production tool.

The techniques in this post restarting on errors, verbose logging, screen recording, chunked steps, and assertions transform fragile scripts into resilient automations that can run unattended and recover from the chaos that real-world applications throw at them.

Build error handling from day one. Your future self will thank you.

Authors

Faizaan Chishtie

Copy Link