The trouble with pdffonts
is that sometimes it returns nothing, like this:
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
And sometimes it returns this:
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
[none] Type 3 yes no no 266 0
[none] Type 3 yes no no 9 0
[none] Type 3 yes no no 297 0
[none] Type 3 yes no no 341 0
[none] Type 3 yes no no 381 0
[none] Type 3 yes no no 394 0
[none] Type 3 yes no no 428 0
[none] Type 3 yes no no 441 0
[none] Type 3 yes no no 451 0
[none] Type 3 yes no no 480 0
[none] Type 3 yes no no 492 0
[none] Type 3 yes no no 510 0
[none] Type 3 yes no no 524 0
[none] Type 3 yes no no 560 0
[none] Type 3 yes no no 573 0
[none] Type 3 yes no no 584 0
[none] Type 3 yes no no 593 0
[none] Type 3 yes no no 601 0
[none] Type 3 yes no no 644 0
With that in mind, let's write a little text tool to get all the fonts from a pdf:
pdffonts my-doc.pdf | tail -n +3 | cut -d' ' -f1 | sort | uniq
If your pdf is not OCR'ed, this will output nothing or [none]
.
If you want it to run faster, use the -l
flag to only analyze, say, the first 5 pages:
pdffonts -l 5 my-doc.pdf | tail -n +3 | cut -d' ' -f1 | sort | uniq
Now wrap it in a bash script, e.g. is-pdf-ocred.sh
:
#!/bin/bash
MYFONTS=$(pdffonts -l 5 "$1" | tail -n +3 | cut -d' ' -f1 | sort | uniq)
if [ "$MYFONTS" = '' ] || [ "$MYFONTS" = '[none]' ]; then
echo "NOT OCR'ed: $1"
else
echo "$1 is OCR'ed."
fi
Finally, we want to be able to search for pdfs. The find
command does not know about your aliases or functions in .bashrc
, so we need to give it the path to the script.
Run it in your directory of choice like so:
find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' \;
I'm assuming that the pdf files end in .pdf
, although this is not always an assumption you can make.
You will probably want to pipe it to less or output it into a text file:
find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' \; | less
find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' \; > pdfs.txt
I was able to do about 200 pdfs in a little more than 10 seconds using the -l 5
flag.