{"id":29868,"date":"2023-11-20T07:34:17","date_gmt":"2023-11-20T15:34:17","guid":{"rendered":"https:\/\/www.podfeet.com\/blog\/?p=29868"},"modified":"2023-11-20T07:34:17","modified_gmt":"2023-11-20T15:34:17","slug":"ocr-pdf","status":"publish","type":"post","link":"https:\/\/www.podfeet.com\/blog\/2023\/11\/ocr-pdf\/","title":{"rendered":"OCR PDFs with Open Source Tools on Linux by George from Tulsa"},"content":{"rendered":"<p>George from Tulsa here responding to Allison\u2019s request for a show contribution to reduce her load this Thanksgiving week.<\/p>\n<p>Years ago we paid a bank service company to microfilm file cabinets full of irreplaceable paper &#8211; some now 120 years old. The company then scanned the microfilm to image-only PDFs it delivered to us on optical disks.<\/p>\n<p>The working set of PDFs currently resides on our Synology NAS where they&#8217;re in a folder structure organized by an indexed table of contents.<\/p>\n<p>I&#8217;m now engaged in a project to run the gigabytes of image-only PDFs through Optical Character Recognition. This will enable searching for documents across the network by searching for text within documents, searching within open documents, and copying and pasting text and data tables to new documents and spreadsheets.<\/p>\n<p>Since I&#8217;m mostly using Linux, specifically Linux Mint Cinnamon, I&#8217;m going to briefly describe here how that works in Mint and put the more difficult technical &#8220;stuff&#8221;, and all my links, at the bottom of today&#8217;s Shownotes.  I&#8217;ll also talk about Mac options for the same process.<\/p>\n<p>To begin, it&#8217;s necessary to download two applications from the Mint Software Center:<\/p>\n<p>Tesseract is an OCR &#8220;engine&#8221; originally developed by Hewlett-Packard and maintained by Google since 2006. Tesseract is really fast, taking advantage of all 8 cores of my Ryzen 7 Processor. It&#8217;s also surprisingly accurate even on less-than-optimal scans of old paper.  Many languages are available, but I&#8217;ve only installed English.<\/p>\n<p>Tesseract is not user-facing. It must be invoked by another program.  For us, that&#8217;s &#8220;OCRMYPDF&#8221; which is started by commands in the Linux terminal.<\/p>\n<p>Running terminal commands can be scary. No worry here as all we&#8217;re doing is duplicating the original PDF to a new file with OCR without making any changes to the original. The command is one brief line you&#8217;ll be able to copy and paste from these shownotes where you&#8217;ll also find <a href =\"#steps\">step-by-step instructions<\/a>.<\/p>\n<p>Processing is so fast I&#8217;m using the Linux Application PDF Arranger to merge related PDFs. Think monthly financial statements consolidated into searchable annual documents hundreds of pages long. That works great for what I&#8217;m doing. PDF Arranger will also split long documents into shorter chunks if that works better for you.<\/p>\n<p>What if you want to OCR a file with a lot of text that\u2019s saved as, for example, a JPG?  Simply print the JPG to PDF and you\u2019re good to go.<\/p>\n<p>Okular is a Linux file viewer that has some editing and annotation capabilities. What I find invaluable is its Table Tool which extracts tabular data that can be pasted into spreadsheets for analysis.<\/p>\n<p>One other application to mention. gImageReader, available on Windows and Linux, uses the Tesseract engine for granular OCR and editing of blocks of text.  It does not embed the text within a PDF but saves it as a separate TXT file.  Down in the Shownotes, there\u2019s a neat video link demonstrating it being used to simultaneously OCR text in Korean and English while the user interactively corrects errors.<\/p>\n<p>It&#8217;s of course possible to OCR digital documents on a Mac.<\/p>\n<p>For a small number of documents, if you have a ScanSnap which comes with the limited version of ABBYY FineReader, the easiest solution is to print the PDF to paper then re-scan with OCR enabled.  That won&#8217;t work for me because of the gigabytes I need to process and the forest all that printing would kill.<\/p>\n<p>If you&#8217;re geeky and love playing with computers, you might be able to get Tesseract and OCRMYPDF to run on a Mac using MacPorts or HomeBrew.<\/p>\n<p>The full Mac versions of ABBYY FineReader, a $69 annual subscription, and Adobe Acrobat PRO, $30 a month or $240 annually, do retroactive OCR. I had the, HA!, perpetual version of Adobe Acrobat PRO 8 and found its OCR results required significant manual correction. Perhaps Acrobat is much better now.  Both offer free trials.<\/p>\n<p>Amazon Software Downloads offers an apparently perpetual version of the ABBYY\u2019s 2015 version.  But from reviews, I suspect it isn\u2019t compatible with current versions of macOS.<\/p>\n<p>UPDF Googled up as another Mac and iOS option. Brief research revealed it is a product of the Chinese company Superace and its privacy statement makes clear that if you&#8217;re using its hallmark AI features your content will be uploaded to Superace&#8217;s servers.<\/p>\n<p>Speaking of privacy policies, ABBYY&#8217;s, Adobe&#8217;s, and UPDF&#8217;s are all opaque and confusing, and I&#8217;m a lawyer.  I&#8217;m pretty sure all are at the least monitoring when, where, how, and on what computer their software is used. Do read and understand their settings, privacy policies, and End User License Agreements, especially if you&#8217;re processing confidential documents.<\/p>\n<p>Privacy is a reason you might want to try a Linux system of your own that can run open source applications which don&#8217;t phone home.<\/p>\n<p>Cost is another reason. There&#8217;s a new generation of nano-sized Linux systems with useful specs that begin as low as $130. Compare that cost to Acrobat or ABBY. Or the $99 a year virtualization application Parallels that will run Windows and Linux on Macs and, boy does Parallels phone home.<\/p>\n<p>I&#8217;m wrapping up my audio here, but if you&#8217;re interested in instructions and links, check out this Episode\u2019s Shownotes at Podfeet.com<\/p>\n<h1><a name=\"steps\">Steps to OCR using OCRMYPDF with Tesseract<\/a><\/h1>\n<p>The OCRMYPDF command in TEXT that can be copied and pasted into Terminal:<\/p>\n<p><code>ocrmypdf --output-type pdf 1.pdf 2.pdf<\/code><\/p>\n<figure style=\"float: center; margin: 10px\"><img decoding=\"async\" src=\"https:\/\/www.podfeet.com\/blog\/wp-content\/uploads\/2023\/11\/Step-01.png\" alt=\"Obtain and install software from Linux Mint software manager - Tesseract and OCRMYPDF\"  title=\"Step 01.png\" width=\"599 \" height=\"341\"><figcaption style=\"text-align:center\">Obtain Tesseract &#038; OCRMYPDF<\/figcaption><\/figure>\n<figure style=\"float: center; margin: 10px\"><img decoding=\"async\" src=\"https:\/\/www.podfeet.com\/blog\/wp-content\/uploads\/2023\/11\/Step-02.png\" alt=\"Install shows Gimagereader and says free\"  title=\"Step 02.png\" width=\"599 \" height=\"315\"><figcaption style=\"text-align:center\">Install in One Click<\/figcaption><\/figure>\n<figure style=\"float: center; margin: 10px\"><img decoding=\"async\" src=\"https:\/\/www.podfeet.com\/blog\/wp-content\/uploads\/2023\/11\/Step-03.png\" alt=\"Create a folder called OCR and put a text file in it with the one command for ocrmypdf\"  title=\"Step 03.png\" width=\"600 \" height=\"411\"><figcaption style=\"text-align:center\">OCR Folder with One-Line OCRMYPDF Command in a Text File<\/figcaption><\/figure>\n<figure style=\"float: center; margin: 10px\"><img decoding=\"async\" src=\"https:\/\/www.podfeet.com\/blog\/wp-content\/uploads\/2023\/11\/Step-04.png\" alt=\"NosillaCast podcast logo as a PDF about to be run through the ocrmypdf command\"  title=\"Step 04.png\" width=\"599 \" height=\"395\"><figcaption style=\"text-align:center\">NosillaCast Logo as a PDF Test<\/figcaption><\/figure>\n<figure style=\"float: center; margin: 10px\"><img decoding=\"async\" src=\"https:\/\/www.podfeet.com\/blog\/wp-content\/uploads\/2023\/11\/Step-05.png\" alt=\"OCR folder before running the command\"  title=\"Step 05.png\" width=\"598 \" height=\"233\"><figcaption style=\"text-align:center\">OCR Folder Before the Command<\/figcaption><\/figure>\n<figure style=\"float: center; margin: 10px\"><img decoding=\"async\" src=\"https:\/\/www.podfeet.com\/blog\/wp-content\/uploads\/2023\/11\/Step-06.png\" alt=\"Right click on folder to choose Open in Termal\"  title=\"Step 06.png\" width=\"600 \" height=\"461\"><figcaption style=\"text-align:center\">Open Folder in Terminal<\/figcaption><\/figure>\n<figure style=\"float: center; margin: 10px\"><img decoding=\"async\" src=\"https:\/\/www.podfeet.com\/blog\/wp-content\/uploads\/2023\/11\/Step-07.png\" alt=\"run the one-line command\"  title=\"Step 07.png\" width=\"463 \" height=\"449\"><figcaption style=\"text-align:center\">Paste in the One-Line Command<\/figcaption><\/figure>\n<figure style=\"float: center; margin: 10px\"><img decoding=\"async\" src=\"https:\/\/www.podfeet.com\/blog\/wp-content\/uploads\/2023\/11\/Step-08.png\" alt=\"After OCR has run we now see a second pdf file\"  title=\"Step 08.png\" width=\"509 \" height=\"292\"><figcaption style=\"text-align:center\">Folder After Command<\/figcaption><\/figure>\n<figure style=\"float: center; margin: 10px\"><img decoding=\"async\" src=\"https:\/\/www.podfeet.com\/blog\/wp-content\/uploads\/2023\/11\/Step-09.png\" alt=\"New PDF with Searchable Text Highlighted. It kidn of fails and says Pocdescr instead of Podcast\"  title=\"Step 09.png\" width=\"324 \" height=\"323\"><figcaption style=\"text-align:center\">A Few Mistakes<\/figcaption><\/figure>\n<figure style=\"float: center; margin: 10px\"><img decoding=\"async\" src=\"https:\/\/www.podfeet.com\/blog\/wp-content\/uploads\/2023\/11\/Step-10.png\" alt=\"Confession worst results ever so here it is in gImagereader which also uses Tesseract and it&#39;s perfect\"  title=\"Step 10.png\" width=\"556 \" height=\"478\"><figcaption style=\"text-align:center\">Better Results with gImagereader<\/figcaption><\/figure>\n<hr>\n<p>Privacy Policies:<\/p>\n<ul>\n<li>ABBYY: <a href=\"https:\/\/pdf.abbyy.com\/finereader-ios\/eula\/\">pdf.abbyy.com\/&#8230;<\/a><\/li>\n<li>Adobe: <a href=\"https:\/\/www.adobe.com\/privacy\/policy.html\">www.adobe.com\/&#8230;<\/a><\/li>\n<li>UPDF: <a href=\"https:\/\/updf.com\/privacy-policy\/\">updf.com\/&#8230;<\/a><\/li>\n<li>Parellels: <a href=\"https:\/\/www.alludo.com\/en\/legal\/privacy\/\">www.alludo.com\/&#8230;<\/a><\/li>\n<\/ul>\n<p>ABBYY and Adobe Acrobat Pro Trials:<\/p>\n<ul>\n<li><a href=\"https:\/\/pdf.abbyy.com\/finereader-pdf\/trial\/\">pdf.abbyy.com\/&#8230;<\/a><\/li>\n<li><a href=\"https:\/\/www.adobe.com\/acrobat\/free-trial-download.html\">www.adobe.com\/&#8230;<\/a><\/li>\n<\/ul>\n<p>Linux OCR Software<\/p>\n<ul>\n<li>Tesseract: <a href=\"https:\/\/en.wikipedia.org\/wiki\/Tesseract_(software)\">en.wikipedia.org\/&#8230;<\/a><\/li>\n<li>ocrmypdf:  <a href=\"https:\/\/github.com\/ocrmypdf\/OCRmyPDF\">github.com\/&#8230;<\/a><\/li>\n<\/ul>\n<p>PDFArranger<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/pdfarranger\">github.com\/&#8230;<\/a><\/li>\n<\/ul>\n<p>Okular &#8211; The Universal Document Viewer<\/p>\n<ul>\n<li><a href=\"https:\/\/okular.kde.org\/\">okular.kde.org\/&#8230;<\/a><\/li>\n<\/ul>\n<p>gImageReader &#8211; Linux and Windows<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/manisandro\/gImageReader#readme\">github.com\/&#8230;<\/a><\/li>\n<li>YouTube video showing gImageReader in action: <a href=\"https:\/\/www.youtube.com\/watch?v=GMAZtpWQF0U\">www.youtube.com\/&#8230;<\/a><\/li>\n<\/ul>\n<p>Run Linux software on a Mac?<\/p>\n<ul>\n<li><a href=\"https:\/\/www.maketecheasier.com\/ways-run-linux-software-mac\/\">www.maketecheasier.com\/&#8230;<\/a><\/li>\n<\/ul>\n<p>&#8220;<a href=\"https:\/\/www.podfeet.com\/blog\/2011\/01\/295-lastpass-twitter-explained-boxee\/\" target=\"_blank\" rel=\"noopener\">The Vault of Useless Backups<\/a>,&#8221; where in Nosillacast #295 on Janauary 16, 2011 I first discussed the paper I&#8217;m now processing to OCR.  &#8220;If there\u2019s something you absolutely, positively have to keep, paper will outlive computers.&#8221; Subtext: proprietary computer gear and software will let you down when you need it most.<\/p>\n<p>Maybe an Inexpensive NUC is All the Computer You Need<\/p>\n<ul>\n<li><a href=\"https:\/\/www.podfeet.com\/blog\/2018\/05\/gemini-nuc\/\">www.podfeet.com\/&#8230;<\/a><\/li>\n<\/ul>\n<p>Overview of current Mini-PCs:<\/p>\n<ul>\n<li><a href=\"https:\/\/liliputing.com\/tag\/mini-pc\/\">liliputing.com\/&#8230;<\/a><\/li>\n<\/ul>\n<p>The Kamrui AK1 Plus is a dirt cheap mini PC with a 15-watt Intel Processor N95 quad-core chip and list prices starting as low as $180 (although the AK1 Plus is currently on sale for as little as $126).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>George from Tulsa here responding to Allison\u2019s request for a show contribution to reduce her load this Thanksgiving week. Years ago we paid a bank service company to microfilm file cabinets full of irreplaceable paper &#8211; some now 120 years old. The company then scanned the microfilm to image-only PDFs it delivered to us on [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":29872,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_post_was_ever_published":false},"categories":[147],"tags":[2715,472,1899,410],"class_list":["post-29868","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog-posts","tag-free","tag-ocr","tag-open-source","tag-pdf"],"jetpack_featured_media_url":"https:\/\/www.podfeet.com\/blog\/wp-content\/uploads\/2023\/11\/Step-01-1040x520-1.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/posts\/29868","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/comments?post=29868"}],"version-history":[{"count":3,"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/posts\/29868\/revisions"}],"predecessor-version":[{"id":29871,"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/posts\/29868\/revisions\/29871"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/media\/29872"}],"wp:attachment":[{"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/media?parent=29868"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/categories?post=29868"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/tags?post=29868"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}