OCRing pdfs with pages that contain both text and images

Question

I have the following ubuntu script which checks if my pdfs have been OCRed, then OCRs them if they don't. Problem is, I have some pdfs that are a mix of OCR and non-OCR. So, I wanted to add in a condition to the if statement that says if the number of lines or number of words is less than a certain number (say 100 lines of text or 1000 words), to OCR it. I am completely new to ubuntu, and I have added in a couple of lines (in bold).

MYFONTS=$(pdffonts -l 5 "$1" | tail -n +3 | cut -d' ' -f1 | sort | uniq)
**LINECOUNT=$(wc -l)**
if [ "$MYFONTS" = '' ] || [ "$MYFONTS" = '[none]' ] **|| [ "$LINECOUNT" < '100' ]**; then
echo "Not yet OCR'ed: $1 -------- Processing...."
echo " "
ocrmypdf -l eng -s "$1" "$1"
echo " "
else
echo "Already OCR'ed: $1"
echo " "
fi

The script was obtained from here: Batch OCRing PDFs that haven't already been OCR'd

Grubbmeister · Answer 1 · 2019-07-16T04:38:21.787

Because some of my pdfs have pages with text AND scanned images, I ran the script above to deal with any only image pdfs. I then modified the script like so and ran it to clear up any problem pdfs:

LINECOUNT=$(wc -l "$1" | awk '{ print $1 }') 
 if  [ "$LINECOUNT" -lt 500 ]; then  
 echo "Not yet OCR'ed: $1 -------- Processing...." 
echo " " 
ocrmypdf --force-ocr -k --oversample 600 "$1" "$1"
echo " "       
else     
echo "Already OCR'ed: $1"
echo " "          
fi

Which basically says if the file has less than 500 lines to rasterize it and re-ocr it. Not the most ideal solution, but it didn't look like the command --skip-text would work for me:

ocrmypdf --skip-text to skip OCR and other processing on any pages that contain text. Text pages will be copied into the output PDF without modification.

https://ocrmypdf.readthedocs.io/en/latest/errors.html

Though if someone has a better answer, I'd be happy to hear it.

OCRing pdfs with pages that contain both text and images

1 Answers1