5

I have a pdf with watermark at the background of it. When start scanning for highlighting any word with watermark or annotation at background, that gets selected as it is found first in touch area.

I am using CGPDFScanner to scan the text.

My question is how detect if scanned text is text at background or real text in PDF? How do I differentiate between standard text and annotation text?

Thanks.

Swaroop
  • 501
  • 4
  • 18
  • 1
    Unfortunately I cannot download your PDF, I press the button on the page of the file sharing service but the page merely refreshes. That being said, though, you in general have no chance to differentiate between "background" and "real" text. In case of *tagged* PDFs you might have a chance, the waterpark may be tagged as artifact data. – mkl Jun 19 '15 at 13:58
  • @mkl: please turn your comment into a real answer to get my upvote. :-) – Kurt Pfeifle Jun 19 '15 at 21:33
  • @mkl Sorry I ll share the file again. – Swaroop Jun 20 '15 at 06:48
  • Here is the link to file : http://www.filedropper.com/pdf_8 – Swaroop Jun 20 '15 at 06:54

1 Answers1

3

In general you have no chance to reliably differentiate between "background" and "real" text. Text is drawn somewhere on the page in some order, and what is foreground, background, normal text, ..., is a matter of human perception and may not at all be reflected in the structure of the PDF content stream.

You can try some educated guesswork, e.g. assuming that "real" text is in strong colors while background text is in lighter colors, or "real" text is arranged in horizontal lines while background text is often more diagonal, etc. But this is guesswork after all, nothing to rely on for sure.

On the other hand, in case of tagged PDFs you might have a chance, the watermark may be tagged as artifact data.

PS I just saw you shared your file again. In case of your document the heuristics I mentioned would work, the background text is greyish and printed diagonally.

Thus, while scanning you have to keep track of the fill color and/or the transformation matrices. As soon as the scanner finds text, you know whether it is background or foreground based on the current color and/or matrix value.

Be aware, though, it is not that easy with all documents.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thanks for the reply :). I thought of one guesswork based on Height & Width of the text of watermark. It can be one of the heuristics right? While scanning, rectangle I get for that text is taking 3/4th of the page so I can decide to skip on that basis as well right? Or it might go wrong? – Swaroop Jun 20 '15 at 09:36
  • That is another heuristics rule, too. But be aware, heuristics are guesswork after all and will fail every once in a while. – mkl Jun 20 '15 at 09:51
  • Okay.. Thanks a lot for the help. :) – Swaroop Jun 20 '15 at 10:56