3

Don't know where to start on this really

I have a linux server with over 8000 PDf's and need to know which PDF's have been ocr'd and which one's haven't.

Was thinking some sort of script calling XPDF to check the pdf but to be honest not sure if this is possible

Thanks in advance for any help

Grimlockz
  • 2,541
  • 7
  • 31
  • 38
  • How do you know if a file has be ocr'd. Is there an output file like file1.pdf.ocr? Good luck. – shellter Nov 03 '11 at 16:30
  • [This may help you](http://stackoverflow.com/questions/6026287/batch-ocr-program-for-pdfs) – potong Nov 03 '11 at 17:21
  • So you want to tell the ones that are text from the ones that are images containing text? In that case you could try `pdftotext` and see if it produces any output. – ninjalj Nov 03 '11 at 18:49

2 Answers2

4

The trouble with pdffonts is that sometimes it returns nothing, like this:

name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------

And sometimes it returns this:

name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
[none]                               Type 3            yes no  no     266  0
[none]                               Type 3            yes no  no       9  0
[none]                               Type 3            yes no  no     297  0
[none]                               Type 3            yes no  no     341  0
[none]                               Type 3            yes no  no     381  0
[none]                               Type 3            yes no  no     394  0
[none]                               Type 3            yes no  no     428  0
[none]                               Type 3            yes no  no     441  0
[none]                               Type 3            yes no  no     451  0
[none]                               Type 3            yes no  no     480  0
[none]                               Type 3            yes no  no     492  0
[none]                               Type 3            yes no  no     510  0
[none]                               Type 3            yes no  no     524  0
[none]                               Type 3            yes no  no     560  0
[none]                               Type 3            yes no  no     573  0
[none]                               Type 3            yes no  no     584  0
[none]                               Type 3            yes no  no     593  0
[none]                               Type 3            yes no  no     601  0
[none]                               Type 3            yes no  no     644  0

With that in mind, let's write a little text tool to get all the fonts from a pdf:

pdffonts my-doc.pdf | tail -n +3 | cut -d' ' -f1 | sort | uniq

If your pdf is not OCR'ed, this will output nothing or [none].

If you want it to run faster, use the -l flag to only analyze, say, the first 5 pages:

pdffonts -l 5 my-doc.pdf | tail -n +3 | cut -d' ' -f1 | sort | uniq

Now wrap it in a bash script, e.g. is-pdf-ocred.sh:

#!/bin/bash
MYFONTS=$(pdffonts -l 5 "$1" | tail -n +3 | cut -d' ' -f1 | sort | uniq)
if [ "$MYFONTS" = '' ] || [ "$MYFONTS" = '[none]' ]; then
    echo "NOT OCR'ed: $1"
else 
    echo "$1 is OCR'ed."
fi 

Finally, we want to be able to search for pdfs. The find command does not know about your aliases or functions in .bashrc, so we need to give it the path to the script. Run it in your directory of choice like so:

find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' \;

I'm assuming that the pdf files end in .pdf, although this is not always an assumption you can make. You will probably want to pipe it to less or output it into a text file:

find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' \; | less
find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' \; > pdfs.txt

I was able to do about 200 pdfs in a little more than 10 seconds using the -l 5 flag.

  • 1
    This approach won't work if the actual PDF is a mixture of text and scanned images. This is pretty common in business. For example when you digitally sign a scanned PDF, then that signing will add a text layer to the PDF so `pdffonts` will output the signature's font even though it was a NOT an OCR'ed PDF. You can work it around deleting the known font from the output with `pdffonts scanned.pdf | grep -v -E 'font_name|-|name'` just in case you know the font name(s) that scanend PDF's will use. – Miquel Perez Jul 26 '17 at 11:01
4

Make sure you have a commandline tool pdffonts installed. (There are two versions of this: one ships as part of the xpdf-utils, the other as part of the poppler-utils.)

All PDFs which consist of scanned pages only will not have any fonts used (neither embedded ones, nor un-embedded ones).

The commandline

pdffonts /path/to/scanned.pdf

will then not show any font information for that file.

This may already be enough for you to separate your files into two different sets.

If you have PDFs which have a mix of scanned pages and "normal" pages (or sanned-and-ocr'ed pages) then you'll have to extend and refine the above simplistic approach. See man pdffonts or pdffonts --help for more info.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345