Detect if a PDF is created from a scanned document using OCR [pdfbox]

Question

I would like to know if a PDF was created from a scanned document using OCR.

To make the text from the scanned document selectable, I guess the same text is written using a transparent color, a special font, ...

I'm using pdfbox and I looked at the font, the color, and many other properties and I didn't find anything special.

It depends on the actual embedding of the OCR'ed data. One often sees the use of rendering mode "invisible" or simply the method to first draw the text and then display the image covering the writing. — mkl, Jun 12 '14 at 12:32
Instead of adding the resolution to your question text, you should have made it an answer. — mkl, Jun 16 '14 at 07:45

score 2 · Accepted Answer · answered Jun 16 '14 at 09:22

2

In my case the text rendering mode was set to "Neither fill nor stroke text".

pdfbox code:

getGraphicsState().getTextState().getRenderingMode() == PDTextState.RENDERING_MODE_NEITHER_FILL_NOR_STROKE_TEXT

answered Jun 16 '14 at 09:22

Co_42

Can you provide all code of example, where PDF contains a multiple pages ? Thanks in advance ! :) – Wojtek May 10 '17 at 09:21

score 0 · Answer 2 · answered Jun 12 '14 at 15:46

In most cases, the original image is still present, and the OCRd text is invisible underneath.

So, one possibility would be finding out whether there is a picture covering all the area with text.

Another possibility would be looking at the fonts and make some smart decisions based on them

2 Answers2