RPA scripts are fragile.
You can write the most elegant automation, test it a hundred times, deploy it confidently to production, and it will still fail in ways you never imagined. A dialog box appears. The network hiccups. The application takes 3 seconds to load instead of 2. A field that's always enabled is suddenly disabled. Chaos reigns.
The difference between an RPA script that's a maintenance nightmare and one that actually works in production isn't avoiding errors it's handling them gracefully when they inevitably occur.
Here's how to build RPA scripts that survive contact with reality.
This is the nuclear option, and it's also the most reliable recovery mechanism you have.
The principle: If anything seems off, if any check fails, if you're not in the expected state close everything and start over from a clean slate.
Why this works: RPA scripts fail because the application gets into an unexpected state. Maybe a window opened in the wrong position. Maybe a dialog you dismissed is still somehow affecting focus. Maybe the database locked a record. Rather than trying to diagnose and fix the specific issue (which you often can't do programmatically), just reset to a known-good state and retry.
def run_with_reset_on_error(automation_function, max_retries=3):
"""
Wrapper that resets everything and retries on any error
"""
for attempt in range(max_retries):
try:
# Always start clean
close_all_application_windows()
clear_clipboard()
reset_to_desktop()
time.sleep(1)
# Run the actual automation
result = automation_function()
# If we got here, success!
return result
except Exception as e:
logging.error(f"Attempt {attempt + 1} failed: {e}")
# Take screenshot of failure state
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
pyautogui.screenshot(f'error_attempt_{attempt}_{timestamp}.png')
if attempt < max_retries - 1:
logging.info(f"Resetting and retrying... ({max_retries - attempt - 1} attempts remaining)")
# Nuclear option: close EVERYTHING
close_all_application_windows()
kill_application_process() # Force kill if needed
time.sleep(2)
else:
logging.error("All retry attempts exhausted")
raise
finally:
# Always clean up, even on success
close_all_application_windows()
When to restart:
Restart strategies by severity:
Soft restart: Close windows, return to home screen
def soft_restart():
pyautogui.hotkey('alt', 'f4') # Close current window
time.sleep(0.5)
pyautogui.press('escape') # Cancel any dialogs
pyautogui.press('escape')
navigate_to_home_screen()
Medium restart: Close application, reopen
def medium_restart():
close_all_application_windows()
time.sleep(1)
launch_application()
wait_for_application_ready()
login_if_needed()
Hard restart: Kill process, clear temp files, cold start
def hard_restart():
kill_application_process()
clear_application_temp_files()
clear_application_cache()
time.sleep(2)
launch_application()
wait_for_application_ready()
login_if_needed()
Pro tip: Keep track of which restart level you're at. If soft restart fails, try medium. If medium fails, go hard.
def run_with_escalating_restarts(automation_function):
restart_strategies = [
('soft', soft_restart),
('medium', medium_restart),
('hard', hard_restart)
]
for restart_name, restart_func in restart_strategies:
try:
restart_func()
result = automation_function()
return result
except Exception as e:
logging.error(f"Failed after {restart_name} restart: {e}")
continue
raise Exception("All restart strategies failed")
When your RPA script fails at 3 AM on a production server you can't access, logs are the only window into what happened.
Log everything. And I mean everything.
import logging
from datetime import datetime
# Set up detailed logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler(f'rpa_log_{datetime.now().strftime("%Y%m%d")}.log'),
logging.StreamHandler() # Also print to console
]
)
def process_invoice(invoice_id):
logging.info(f"{'='*50}")
logging.info(f"Starting invoice processing for ID: {invoice_id}")
logging.info(f"{'='*50}")
try:
# Log before every action
logging.info("Opening invoice screen")
open_invoice_screen()
logging.info("Invoice screen opened successfully")
logging.info(f"Searching for invoice: {invoice_id}")
search_invoice(invoice_id)
logging.info("Search completed")
# Log state checks
if not is_invoice_found():
logging.warning(f"Invoice {invoice_id} not found in system")
return None
logging.info("Invoice found, proceeding with extraction")
# Log data extraction
logging.info("Extracting invoice data")
data = extract_invoice_data()
logging.info(f"Extracted data: {data}")
# Log validation
logging.info("Validating extracted data")
if not validate_invoice_data(data):
logging.error(f"Data validation failed for invoice {invoice_id}")
logging.error(f"Invalid data: {data}")
raise ValueError("Data validation failed")
logging.info("Data validation passed")
logging.info(f"Invoice {invoice_id} processed successfully")
return data
except Exception as e:
logging.error(f"Fatal error processing invoice {invoice_id}: {e}")
logging.error(f"Error type: {type(e).__name__}")
logging.error(f"Error details: {str(e)}")
# Log the state at time of failure
logging.error("Attempting to capture failure state...")
try:
active_window = get_active_window_title()
logging.error(f"Active window at failure: {active_window}")
except:
logging.error("Could not determine active window")
raise
What to log:
Log levels for different situations:
logging.debug("Tab count is 5") # Detailed debugging info
logging.info("Processing invoice 12345") # Normal operation
logging.warning("Window took 3 seconds to appear (expected 1s)") # Concerning but not fatal
logging.error("Failed to find Save button") # Error that might be recoverable
logging.critical("Application crashed, unable to restart") # Fatal error
Context managers for action logging:
from contextlib import contextmanager
@contextmanager
def log_action(action_name):
"""Context manager to log start and end of actions"""
logging.info(f"Starting: {action_name}")
start_time = time.time()
try:
yield
duration = time.time() - start_time
logging.info(f"Completed: {action_name} (took {duration:.2f}s)")
except Exception as e:
duration = time.time() - start_time
logging.error(f"Failed: {action_name} after {duration:.2f}s - {e}")
raise
# Usage
with log_action("Open customer form"):
open_customer_form()
with log_action("Enter customer data"):
enter_customer_name("John Smith")
enter_customer_email("john@example.com")
Log rotation to prevent disk space issues:
from logging.handlers import RotatingFileHandler
# Keep max 10 log files of 10MB each
handler = RotatingFileHandler(
'rpa_log.log',
maxBytes=10*1024*1024, # 10MB
backupCount=10
)
Logs tell you what the script thought was happening. Screen recordings show you what actually happened.
Use OBS Studio or similar to record your RPA scripts running. This is especially critical during:
Setting up automatic screen recording:
import subprocess
import signal
class ScreenRecorder:
def __init__(self, output_path):
self.output_path = output_path
self.process = None
def start(self):
"""Start OBS recording via command line"""
# Requires OBS with obs-cli or similar plugin
logging.info(f"Starting screen recording: {self.output_path}")
# Example using ffmpeg for screen recording
self.process = subprocess.Popen([
'ffmpeg',
'-f', 'gdigrab', # Windows screen capture
'-framerate', '10', # Lower framerate to save space
'-i', 'desktop',
'-c:v', 'libx264',
'-preset', 'ultrafast',
self.output_path
], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
time.sleep(2) # Give it time to start
def stop(self):
"""Stop recording"""
if self.process:
logging.info("Stopping screen recording")
self.process.send_signal(signal.CTRL_C_EVENT)
self.process.wait()
def run_with_recording(automation_function):
"""Run automation with automatic screen recording"""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
recorder = ScreenRecorder(f'rpa_recording_{timestamp}.mp4')
try:
recorder.start()
result = automation_function()
return result
except Exception as e:
logging.error(f"Error occurred during recording - video saved for review")
raise
finally:
recorder.stop()
Why screen recording is invaluable:
Invisible issues: Sometimes errors happen so fast you can't see them. Slow down the recording and watch frame-by-frame.
Mystery dialogs: "A dialog appeared but I don't know which one." Watch the recording, pause on the dialog, read it.
Timing problems: "The script clicks too fast." Watch the recording to see exactly what's happening and adjust delays.
Regression testing: Compare recordings before and after changes to see what's different.
Training data: If you're building ML models or using LLMs to generate automation, recordings provide training examples.
Proof for stakeholders: "The automation failed because of a network timeout." Show them the recording of the spinning loader that never finishes.
Screen recording best practices:
def cleanup_old_recordings(days_old=7, keep_errors=True):
"""Delete old recording files to save disk space"""
cutoff_date = datetime.now() - timedelta(days=days_old)
for recording in Path('recordings').glob('*.mp4'):
# Keep error recordings
if keep_errors and 'error' in recording.name:
continue
# Delete old successful recordings
if recording.stat().st_mtime < cutoff_date.timestamp():
logging.info(f"Deleting old recording: {recording}")
recording.unlink()
Monolithic RPA scripts are debugging nightmares. Break everything into small, self-contained steps with clear error boundaries.
Bad approach:
def process_customer_order():
# 200 lines of code
# If line 150 fails, good luck figuring out what state you're in
open_application()
login()
search_customer()
verify_customer()
open_order_form()
enter_line_items()
calculate_totals()
apply_discount()
submit_order()
print_confirmation()
email_customer()
close_application()
Good approach:
class OrderProcessingSteps:
"""Each step is independent and can recover from errors"""
def __init__(self):
self.state = {
'customer_id': None,
'order_id': None,
'line_items': [],
'total': 0
}
def step_1_initialize(self):
"""Open application and log in"""
logging.info("STEP 1: Initialize application")
try:
with log_action("Launch application"):
open_application()
with log_action("Login"):
login()
with log_action("Navigate to orders screen"):
navigate_to_orders()
return True
except Exception as e:
logging.error(f"Step 1 failed: {e}")
return False
def step_2_find_customer(self, customer_id):
"""Search and verify customer"""
logging.info(f"STEP 2: Find customer {customer_id}")
try:
with log_action("Search customer"):
search_customer(customer_id)
with log_action("Verify customer details"):
customer_data = extract_customer_data()
if not validate_customer_data(customer_data):
raise ValueError("Customer validation failed")
self.state['customer_id'] = customer_id
return True
except Exception as e:
logging.error(f"Step 2 failed: {e}")
# Can retry from beginning of this step
return False
def step_3_create_order(self, line_items):
"""Create new order with line items"""
logging.info("STEP 3: Create order")
try:
with log_action("Open new order form"):
open_new_order_form()
for idx, item in enumerate(line_items):
with log_action(f"Add line item {idx+1}/{len(line_items)}"):
add_line_item(item)
# Save state after each item in case we need to resume
self.state['line_items'].append(item)
with log_action("Calculate totals"):
total = calculate_order_total()
self.state['total'] = total
return True
except Exception as e:
logging.error(f"Step 3 failed: {e}")
logging.error(f"Completed {len(self.state['line_items'])} of {len(line_items)} items")
return False
def step_4_submit_order(self):
"""Submit and confirm order"""
logging.info("STEP 4: Submit order")
try:
with log_action("Submit order"):
submit_order()
with log_action("Get order confirmation"):
order_id = extract_order_id()
self.state['order_id'] = order_id
logging.info(f"Order submitted successfully: {order_id}")
return True
except Exception as e:
logging.error(f"Step 4 failed: {e}")
return False
def process_customer_order_with_recovery(customer_id, line_items):
"""Process order with step-by-step recovery"""
processor = OrderProcessingSteps()
# Each step can be retried independently
if not retry_step(processor.step_1_initialize):
raise Exception("Failed to initialize")
if not retry_step(lambda: processor.step_2_find_customer(customer_id)):
raise Exception(f"Failed to find customer {customer_id}")
if not retry_step(lambda: processor.step_3_create_order(line_items)):
# We know exactly how many items were added before failure
logging.error(f"Order creation failed after {len(processor.state['line_items'])} items")
raise Exception("Failed to create order")
if not retry_step(processor.step_4_submit_order):
raise Exception("Failed to submit order")
return processor.state['order_id']
def retry_step(step_function, max_attempts=3):
"""Retry a step with restart between attempts"""
for attempt in range(max_attempts):
try:
result = step_function()
if result:
return True
except Exception as e:
logging.error(f"Attempt {attempt + 1} failed: {e}")
if attempt < max_attempts - 1:
logging.info("Restarting application before retry")
soft_restart()
return False
Benefits of chunked steps:
RPA scripts fail silently. You click a button, nothing happens, and the script merrily continues thinking everything is fine. Until it's not.
Write assertions to verify every assumption.
def assert_window_is_active(window_title, timeout=5):
"""Verify expected window is active"""
start_time = time.time()
while time.time() - start_time < timeout:
active = get_active_window_title()
if window_title.lower() in active.lower():
logging.info(f"✓ Confirmed active window: {window_title}")
return True
time.sleep(0.5)
# Take screenshot of unexpected state
actual_window = get_active_window_title()
pyautogui.screenshot(f'assertion_failed_window_{int(time.time())}.png')
raise AssertionError(
f"Expected window '{window_title}' to be active, "
f"but found '{actual_window}'"
)
def assert_field_contains(expected_value, field_name="field"):
"""Verify a field contains expected data"""
# Extract actual value (via clipboard or OCR)
pyautogui.hotkey('ctrl', 'a')
pyautogui.hotkey('ctrl', 'c')
time.sleep(0.1)
actual_value = pyperclip.paste()
if actual_value.strip() != expected_value.strip():
logging.error(f"✗ Field '{field_name}' assertion failed")
logging.error(f" Expected: '{expected_value}'")
logging.error(f" Actual: '{actual_value}'")
pyautogui.screenshot(f'assertion_failed_field_{int(time.time())}.png')
raise AssertionError(
f"Field '{field_name}' contains '{actual_value}', "
f"expected '{expected_value}'"
)
logging.info(f"✓ Confirmed field '{field_name}' = '{expected_value}'")
def assert_element_exists(element_description, check_function):
"""Verify an element exists before proceeding"""
if not check_function():
logging.error(f"✗ Element '{element_description}' not found")
pyautogui.screenshot(f'assertion_failed_element_{int(time.time())}.png')
raise AssertionError(f"Required element '{element_description}' not found")
logging.info(f"✓ Confirmed element exists: {element_description}")
Use assertions throughout your script:
def enter_customer_order():
# Assert we're on the right screen
assert_window_is_active("Order Entry")
# Navigate to customer field
for _ in range(3):
pyautogui.press('tab')
# Enter customer ID
customer_id = "12345"
pyautogui.write(customer_id)
# Assert it was entered correctly
assert_field_contains(customer_id, "Customer ID")
# Move to next field
pyautogui.press('tab')
time.sleep(0.5)
# Assert customer name populated (auto-filled by application)
pyperclip.copy('')
pyautogui.hotkey('ctrl', 'c')
time.sleep(0.1)
customer_name = pyperclip.paste()
if not customer_name or len(customer_name) < 2:
raise AssertionError("Customer name did not auto-populate - invalid customer ID?")
logging.info(f"✓ Customer name auto-populated: {customer_name}")
# Continue with order...
Common assertions:
def assert_no_error_dialogs():
"""Check for common error dialog titles"""
error_titles = ["Error", "Warning", "Failed", "Cannot", "Invalid"]
active_window = get_active_window_title()
for error_word in error_titles:
if error_word.lower() in active_window.lower():
logging.error(f"✗ Error dialog detected: {active_window}")
pyautogui.screenshot(f'error_dialog_{int(time.time())}.png')
raise AssertionError(f"Error dialog appeared: {active_window}")
logging.debug("✓ No error dialogs detected")
Not all errors are equal. Some should trigger immediate restart, others should retry the specific step, and some might be fatal.
class RPAErrorHandler:
"""Centralized error handling with recovery strategies"""
def __init__(self):
self.error_history = []
def handle_error(self, error, context):
"""
Decide how to handle an error based on type and context
Args:
error: The exception that occurred
context: Dict with info about where/when error happened
Returns:
Action to take: 'retry', 'restart', 'skip', or 'fatal'
"""
self.error_history.append({
'error': str(error),
'type': type(error).__name__,
'context': context,
'timestamp': datetime.now()
})
error_type = type(error).__name__
error_message = str(error).lower()
# Timeout errors - usually safe to retry
if 'timeout' in error_message or 'timed out' in error_message:
logging.warning("Timeout detected - will retry")
return 'retry'
# Window not found - restart application
if 'window' in error_message and 'not found' in error_message:
logging.warning("Window not found - will restart")
return 'restart'
# Data validation errors - skip this record
if isinstance(error, ValueError) or 'validation' in error_message:
logging.warning("Validation error - will skip")
return 'skip'
# Clipboard empty - retry the copy operation
if 'clipboard' in error_message:
logging.warning("Clipboard issue - will retry")
return 'retry'
# Element not found - restart might help
if 'not found' in error_message or 'could not find' in error_message:
logging.warning("Element not found - will restart")
return 'restart'
# Network errors - retry with backoff
if 'network' in error_message or 'connection' in error_message:
logging.warning("Network error - will retry with delay")
time.sleep(5) # Wait before retry
return 'retry'
# Repeated same error - escalate
recent_errors = [e['error'] for e in self.error_history[-3:]]
if recent_errors.count(str(error)) >= 3:
logging.error("Same error repeated 3 times - treating as fatal")
return 'fatal'
# Unknown error - default to restart
logging.warning(f"Unknown error type: {error_type} - will restart")
return 'restart'
# Usage
error_handler = RPAErrorHandler()
def process_with_smart_error_handling(items):
"""Process items with intelligent error recovery"""
results = []
for item in items:
attempt = 0
max_attempts = 3
while attempt < max_attempts:
try:
result = process_single_item(item)
results.append({'item': item, 'status': 'success', 'result': result})
break
except Exception as e:
attempt += 1
action = error_handler.handle_error(e, {
'item': item,
'attempt': attempt,
'function': 'process_single_item'
})
if action == 'retry':
logging.info(f"Retrying item {item} (attempt {attempt})")
continue
elif action == 'restart':
logging.info("Restarting application")
soft_restart()
continue
elif action == 'skip':
logging.warning(f"Skipping item {item} due to validation error")
results.append({'item': item, 'status': 'skipped', 'error': str(e)})
break
elif action == 'fatal':
logging.critical(f"Fatal error processing item {item}")
results.append({'item': item, 'status': 'fatal', 'error': str(e)})
raise
if attempt >= max_attempts:
logging.error(f"Exhausted retries for item {item}")
results.append({'item': item, 'status': 'failed', 'error': 'Max retries exceeded'})
return results
Here's what a production-ready RPA script looks like with all these error handling techniques:
def production_rpa_workflow(items_to_process):
"""
Production-grade RPA with comprehensive error handling
"""
# Initialize
error_handler = RPAErrorHandler()
recorder = ScreenRecorder(f'workflow_{datetime.now().strftime("%Y%m%d_%H%M%S")}.mp4')
results = []
try:
# Start recording
recorder.start()
logging.info(f"{'='*60}")
logging.info(f"Starting workflow: {len(items_to_process)} items to process")
logging.info(f"{'='*60}")
# Step 1: Initialize with retry
if not retry_step(initialize_application, max_attempts=3):
raise Exception("Failed to initialize application")
# Process each item
for idx, item in enumerate(items_to_process):
logging.info(f"Processing item {idx+1}/{len(items_to_process)}: {item}")
attempt = 0
max_attempts = 3
while attempt < max_attempts:
try:
# Checkpoint: verify we're in good state
assert_window_is_active("Main Application")
assert_no_error_dialogs()
# Process item in steps
processor = ItemProcessor()
with log_action(f"Step 1: Open item {item}"):
processor.step_1_open_item(item)
with log_action(f"Step 2: Extract data"):
data = processor.step_2_extract_data()
# Assert data is valid
if not data or len(data) == 0:
raise AssertionError("Extracted data is empty")
with log_action(f"Step 3: Process data"):
result = processor.step_3_process(data)
# Success!
results.append({
'item': item,
'status': 'success',
'result': result
})
logging.info(f"✓ Successfully processed item {item}")
break
except Exception as e:
attempt += 1
pyautogui.screenshot(f'error_item_{item}_attempt_{attempt}.png')
action = error_handler.handle_error(e, {
'item': item,
'index': idx,
'attempt': attempt
})
if action == 'retry' and attempt < max_attempts:
logging.info(f"Retrying item {item}")
continue
elif action == 'restart' and attempt < max_attempts:
logging.info("Performing restart before retry")
soft_restart()
initialize_application()
continue
elif action == 'skip':
results.append({
'item': item,
'status': 'skipped',
'error': str(e)
})
break
else:
results.append({
'item': item,
'status': 'failed',
'error': str(e),
'attempts': attempt
})
break
# Summary
successful = sum(1 for r in results if r['status'] == 'success')
failed = sum(1 for r in results if r['status'] == 'failed')
skipped = sum(1 for r in results if r['status'] == 'skipped')
logging.info(f"{'='*60}")
logging.info(f"Workflow complete: {successful} succeeded, {failed} failed, {skipped} skipped")
logging.info(f"{'='*60}")
return results
except Exception as e:
logging.critical(f"Workflow failed catastrophically: {e}")
raise
finally:
# Always clean up
recorder.stop()
close_all_application_windows()
# Save results
with open(f'results_{datetime.now().strftime("%Y%m%d_%H%M%S")}.json', 'w') as f:
json.dump(results, f, indent=2)
RPA without error handling is a ticking time bomb. RPA with error handling is a reliable production tool.
The techniques in this post restarting on errors, verbose logging, screen recording, chunked steps, and assertions transform fragile scripts into resilient automations that can run unattended and recover from the chaos that real-world applications throw at them.
Build error handling from day one. Your future self will thank you.