0

I would like to know if a PDF was created from a scanned document using OCR.

To make the text from the scanned document selectable, I guess the same text is written using a transparent color, a special font, ...

I'm using pdfbox and I looked at the font, the color, and many other properties and I didn't find anything special.

Co_42
  • 1,169
  • 1
  • 10
  • 19
  • 1
    It depends on the actual embedding of the OCR'ed data. One often sees the use of rendering mode "invisible" or simply the method to first draw the text and then display the image covering the writing. – mkl Jun 12 '14 at 12:32
  • Instead of adding the resolution to your question text, you should have made it an answer. – mkl Jun 16 '14 at 07:45
  • I changed it to an answer – Co_42 Jun 16 '14 at 09:22

2 Answers2

2

In my case the text rendering mode was set to "Neither fill nor stroke text".

pdfbox code:

getGraphicsState().getTextState().getRenderingMode() == PDTextState.RENDERING_MODE_NEITHER_FILL_NOR_STROKE_TEXT
Co_42
  • 1,169
  • 1
  • 10
  • 19
  • Can you provide all code of example, where PDF contains a multiple pages ? Thanks in advance ! :) – Wojtek May 10 '17 at 09:21
0

In most cases, the original image is still present, and the OCRd text is invisible underneath.

So, one possibility would be finding out whether there is a picture covering all the area with text.

Another possibility would be looking at the fonts and make some smart decisions based on them

Max Wyss
  • 3,549
  • 2
  • 20
  • 26