1

I've recently set up a Linux server to be able to convert text-based PDFs to text by using the pdftotext command that's part of Xpdf as well as to convert image-based PDFs to text by using a combination of the gs (Ghostscript) and tesseract commands.

Both solutions work pretty well when I already know whether a PDF is text-based or image-based. However, in order to automate the process of converting many PDFs to text, I need to be able to tell whether a PDF is text-based or image-based so that I know which set of processes to run on the PDF.

Is there any way in PHP to analyze a PDF and tell whether it's text-based or image-based so that I know whether to use Xpdf or Ghostscript/Tesseract on it?

HartleySan
  • 7,404
  • 14
  • 66
  • 119
  • 3
    what if there is a combination of both? – cmorrissey Sep 23 '16 at 18:53
  • Does that happen, and if so, would running Xpdf's `pdftotext` on the file be sufficient? Either way, whether there are two or three distinct types of PDFs, I need to be able to differentiate between them so that I know how to process them to get the text out. Thanks. – HartleySan Sep 23 '16 at 18:54
  • I would run both scripts against the PDF then you need to do a comparison on the output. – cmorrissey Sep 23 '16 at 19:00
  • Yeah, I was kind of worried that that would be the only solution. Xpdf is pretty quick at converting over to text, but the `gs`/Tesseract process is very slow. Maybe I could process everything as text first, and then as a separate process after the fact, check where the text is bad and then image-process it. Any advice on how to detect what is "good" text and what is "bad" text? Thanks. – HartleySan Sep 23 '16 at 19:22
  • 1
    You could explode your text into words and then use `pspell_check` to see how many misspellings you have in a give block vs the number of total words. http://php.net/manual/en/function.pspell-check.php – cmorrissey Sep 23 '16 at 19:35
  • Not a bad idea. Thanks. – HartleySan Sep 24 '16 at 17:41

2 Answers2

1

I think the answer from Kurt Pfeifle here is superb: Use pdffonts - which is also part of Xpdf / Poppler - to list which fonts a PDF uses.

If it uses any font, it contains text. If not, it contains only images.

dankito
  • 958
  • 7
  • 16
0

Comparing the output and deciding if the resulting text from an OCR run is the same as the output from an Xpdf run is a non trivial quest. In the case of a not OCRable PDF text (eg. very small letters), where the text can be extracted by xpdf you will even end with a lot of unnecessary gibberish.

I would suggest extracting images form the PDFs and OCR only those, not the complete PDF. This way

  • You don't have to compare texts [1].
  • Depending how the images are included into the PDF you also might get better OCR results.
  • Also you would avoid unnecessarily OCRing text which is contained as clear text.

As you are already using xpdf you could use pdfimages -all to extract images.

[1] This is not 100% correct, as the PDF might be a sandwiched PDF where there is already a OCRed text layer "behind" the image.

tobltobs
  • 2,782
  • 1
  • 27
  • 33
  • Sorry, but I'm confused about what you are recommending I do. Should I convert all PDFs to images and then OCR them indiscriminately, or are you recommending something else? The OCR process with Tesseract is very slow, so I'd like to avoid OCRing as many PDFs as possible. – HartleySan Sep 26 '16 at 15:46
  • @HartleySan I mean to use a tool to extract embedded images and run the OCR tool only on those. I added some more details to my original answer. – tobltobs Sep 26 '16 at 18:12