An open source command-line program and Python library for applying optical character recognition (OCR) to PDF documents, generating a searchable PDF with an invisible text layer.
OCRmyPDF is an open source command-line program and Python library that applies optical character recognition (OCR) to PDF documents. It takes an existing PDF as input and generates a new searchable PDF as output with an invisible text layer over images.
OCRmyPDF is designed to work on entire directories of PDFs at once to make workflows more efficient. It can detect languages automatically and output PDF/A documents for long-term archiving. Some key features include:
OCRmyPDF aims to be a one-stop solution for making scanned and image-based PDFs fully searchable. It can be used by individuals, organizations, governments and businesses who need to digitize large numbers of existing paper documents. The command-line interface allows automation through scripts and integration into document processing pipelines.
Here are some alternatives to OCRmyPDF:
Suggest an alternative ❐