I use PdfBox library to read all text with position in overriden method processTextPosition(). It is possible to determine which text is invisible?
Asked
Active
Viewed 333 times
0
-
Invisible why? Is it drawn in mode 'invisible' or maybe as clip path without anything to clip? Is it drawn using the same color as the background? Is anything drawn above it? Is it outside the current cropping area (either crop box or clip path)? – mkl May 01 '14 at 20:37
-
I'd like to find out whether the text on the pdf was automatically generated by ocr after scanning or not. This text is usually hidden but I don't know if it's posible to determine this in PdfBox. – Mayo May 01 '14 at 21:10
-
OCR'ed PDFs usually simply cover the text by the scanned image or use the invisible rendering mode (either behind or in front of the scanned image). Concerning recognizing the rendering mode with PDFBox cf [this answer](http://stackoverflow.com/a/20924898/1729265); concerning filtering text hidden by images cf. [this other answer](http://stackoverflow.com/a/20179928/1729265). – mkl May 02 '14 at 07:04