I am currently analyzing a set of PDF files. I want to know how many of the PDF files fall in those 3 categories:
- Digitally Created PDF: The text is there (copyable) and it is guaranteed to be correct as it was created directly e.g. from Word
- Image-only PDF: A scanned document
- Searchable PDF: A scanned document, but an OCR engine was used. The OCR engine put text "below" the image so that you can search / copy the content. As OCR is pretty good, this is correct most of the time. But it is not guaranteed to be correct.
It is easy to identify Image-only PDFs in my domain as every PDF contains text. If I cannot extract any text, it is image only. But how do I know if it is "just" a searchable PDF or if it is a digially created PDF?
By the way, it is not as simple as just looking at the producer as I have seen scanned documents where the Producer field said "Microsoft Word".
Note: As a human, it is easy. I just zoom in on the text. If I see pixels, it's "just" searchable.
Here are 3 example PDF files to test solutions:
- Digitally Created PDF
- Scanned PDF: Well.. not really; I used a script to create images and then put them together as a PDF. But that only means that the quality is very good. It should be very similar to a scan.
- Searchable PDF
What I tried/thought about
- Using the creator/producer: I see "Microsoft Word" in scanned documents. Also this would be tedious.
- Embedded fonts: You can extract embedded fonts. The idea was that a scanned document would not have embedded fonts but just use the default. The idea was wrong, as one can see with the example.