You're staring at a legacy Windows application. It's been running the business for 15 years. There's critical data inside that needs to be extracted for your integration. And there's absolutely no API.
No REST endpoints. No COM SDK. No ODBC access. The "Export" button? Doesn't exist. Or if it does, it creates some proprietary file format from 1997 that nothing can read.
Welcome to the world of integrating with Windows desktop applications.
Modern SaaS applications are built API-first. Every feature exposed through the UI is also available programmatically. Webhooks notify you of changes. OAuth handles authentication. OpenAPI specs document everything.
Legacy desktop applications? Not so much.
These applications were built in an era when:
Yet these applications still run critical business processes. ERP systems, accounting software, inventory management, manufacturing execution systems many are desktop applications with decades of business logic and data trapped inside.
When faced with extracting data from desktop applications without APIs, developers typically try:
1. UI Automation / RPA
2. Database Access
3. File System Monitoring
4. Screen Scraping
These approaches work, but they're all workarounds for a fundamental problem: the application wasn't designed to share its data programmatically.
Here's a trick that's saved me countless hours of complex automation: nearly every Windows desktop application, no matter how old, has a Print function.
And if it can print, it can export data.
Printing is a fundamental Windows capability that's been around since the beginning. Developers didn't have to do anything special Windows provided the printing infrastructure, and applications just plugged into it.
This means:
Windows 10 and later include "Microsoft Print to PDF" as a built-in virtual printer. When you "print" to it, instead of sending output to a physical printer, it creates a PDF file.
This is your secret weapon.
Here's how to use this approach in practice:
Open the application and explore the Print functionality:
Most applications have far more printing capabilities than export capabilities. Accountants and managers from the pre-digital era needed paper reports, so developers built comprehensive printing features.
Use keyboard automation to trigger the print dialog and configure it:
import pyautogui
import time
from pathlib import Path
def print_report_to_pdf(output_path):
"""
Automate printing a report to PDF
Assumes the application is already open to the report screen
"""
# Trigger print dialog (works in most Windows apps)
pyautogui.hotkey('ctrl', 'p')
time.sleep(1)
# Select Microsoft Print to PDF printer
# Usually you can type to search in the printer dropdown
pyautogui.write('Microsoft Print to PDF')
time.sleep(0.5)
# Click Print or press Enter
pyautogui.press('enter')
time.sleep(1)
# Save file dialog appears
# Type the full path where you want to save
pyautogui.write(str(output_path))
time.sleep(0.5)
# Confirm save
pyautogui.press('enter')
time.sleep(2) # Wait for PDF generation
# Verify file was created
if Path(output_path).exists():
print(f"PDF created successfully: {output_path}")
return True
else:
print(f"PDF creation failed")
return False
Once you have a PDF, you can parse it programmatically:
import PyPDF2
import re
def extract_invoice_data(pdf_path):
"""
Extract structured data from a printed invoice PDF
"""
with open(pdf_path, 'rb') as file:
pdf_reader = PyPDF2.PdfReader(file)
# Extract text from all pages
full_text = ""
for page in pdf_reader.pages:
full_text += page.extract_text()
# Parse the text using patterns
# This will vary based on your specific report format
invoice_number = re.search(r'Invoice #:\s*(\d+)', full_text)
customer_name = re.search(r'Customer:\s*(.+)', full_text)
total_amount = re.search(r'Total:\s*\$?([\d,]+\.\d{2})', full_text)
data = {
'invoice_number': invoice_number.group(1) if invoice_number else None,
'customer_name': customer_name.group(1).strip() if customer_name else None,
'total_amount': total_amount.group(1) if total_amount else None
}
return data
# Usage
pdf_path = Path("C:/temp/invoice_12345.pdf")
invoice_data = extract_invoice_data(pdf_path)
print(invoice_data)
Many reports span multiple pages. Your parsing logic needs to handle this:
def extract_customer_list(pdf_path):
"""
Extract customer data from a multi-page list report
"""
customers = []
with open(pdf_path, 'rb') as file:
pdf_reader = PyPDF2.PdfReader(file)
for page_num, page in enumerate(pdf_reader.pages):
text = page.extract_text()
# Skip header/footer on each page
lines = text.split('\n')
for line in lines:
# Parse each line as a customer record
# Format depends on your specific report
match = re.match(r'(\d+)\s+(.+?)\s+([\d-]+)\s+(.+@.+)', line)
if match:
customers.append({
'id': match.group(1),
'name': match.group(2).strip(),
'phone': match.group(3),
'email': match.group(4)
})
return customers
Compared to OCR:
Compared to clipboard scraping:
Compared to UI automation alone:
Compared to database access:
I once had to integrate with a manufacturing company's accounting system from the early 2000s. It had:
But it had comprehensive printing. Every screen had a Print button. Every report could be printed.
The solution:
The whole automation took about 2 seconds per report. It ran nightly, extracting hundreds of invoices. It worked flawlessly for two years until they finally migrated off the legacy system.
Some applications have quirks with Microsoft Print to PDF. Consider installing additional PDF printer drivers:
PDFCreator: Free, open-source PDF printer with command-line optionsAdobe PDF: If you have Adobe Acrobat installedBullzip PDF Printer: Free with good automation support
You can programmatically set the default printer to ensure consistency:
import win32print
def set_default_printer(printer_name):
"""Set the default Windows printer"""
win32print.SetDefaultPrinter(printer_name)
# Before your automation runs
set_default_printer("Microsoft Print to PDF")
Different applications have different print dialogs. Some tips:
pyautogui.hotkey('win', 'up')
before interacting with ittime.sleep(1)
after opening dialogsPDF text extraction works best with simple layouts. For complex reports:
Use tabula-py for tables:
import tabula
# Extract tables from PDF
tables = tabula.read_pdf(pdf_path, pages='all')
for df in tables:
# Each table is a pandas DataFrame
print(df.head())
Use pdfplumber for better layout control:
import pdfplumber
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
# Extract with position information
text = page.extract_text()
# Or extract tables specifically
tables = page.extract_tables()
Here's a gotcha that will waste hours of debugging if you don't know about it:
Problem: When Windows opens a PDF, the default handler matters for your automation.
Edge PDF Viewer:
Ctrl+A
sometimes doesn't workCtrl+C
behavior is inconsistentChrome PDF Viewer:
Ctrl+A
works consistentlySolution: Set Chrome as your default PDF handler:
# PowerShell script to set Chrome as default PDF handler
$chromePath = "C:\Program Files\Google\Chrome\Application\chrome.exe"
# Create registry entries to set Chrome as PDF handler
$regPath = "HKCU:\Software\Microsoft\Windows\CurrentVersion\Explorer\FileExts\.pdf\UserChoice"
Set-ItemProperty -Path $regPath -Name "ProgId" -Value "ChromeHTML"
Or just do it manually:
This matters especially if you're:
This approach isn't perfect. It fails when:
The application doesn't allow printing: Rare, but some security-focused applications disable printing entirely.
Print output is too different from screen data: Some applications format print output in ways that lose important data.
Real-time data is needed: Printing is a batch operation. If you need real-time updates, you'll need webhooks or polling, which print-to-PDF can't provide.
Complex data relationships: PDFs flatten data. If you need to maintain relationships between records (like invoices with line items), parsing PDFs becomes complex.
Binary data: PDFs work for text and simple tables. Images, files, or binary data embedded in the application won't export this way.
In these cases, you'll need to fall back to other approaches like database access, UI automation, or working with the vendor to add proper export functionality.
Print-to-PDF works best as part of a multi-pronged approach:
Print-to-PDF for bulk extraction: Get lists of records, summary reports, data tables
Clipboard for individual values: Extract specific fields from detail screens
Keyboard automation for navigation: Move through the application to access different reports
File system monitoring: Watch for the PDFs being created and process them automatically
Example workflow:
Always verify PDF creation: Check that the file exists and has content before trying to parse it.
Handle print errors gracefully: Sometimes print jobs fail. Log errors and retry.
Clean up temporary files: PDFs accumulate fast. Delete them after successful processing.
Version control your parsing logic: Report formats change. Keep your parsing code in git so you can revert if needed.
Test with multiple report sizes: A 1-page report might parse perfectly, but a 100-page report might timeout or run out of memory.
Document report formats: When you figure out how to parse a report, document the format. Future you will thank you.
Set up monitoring: Alert if PDFs stop being created or parsing starts failing consistently.
Ready to try this approach? Here's your checklist:
The Print dialog is the most underutilized integration point in Windows desktop applications. While everyone is trying to build complex screen scrapers or reverse-engineer database schemas, there's often a simple solution hiding in plain sight: just print it.
It's not glamorous. It's not the "right" way to build integrations. But it works, it's reliable, and it's often the only option that doesn't require licensing negotiations or vendor cooperation.
Next time you're staring at a legacy desktop application wondering how to get data out, press Ctrl+P and see what happens. You might find your integration just got a lot simpler.