1

File example: file.

Problem - when extracting text using PdfTextStripper, there is token "9/1/2017" and "387986" after "ASSETS" in the page start which should be removed, and some others hidden tokens.

I have already applied this solution (so I do not copy-paste it here, because actually problem is exactly the same) and still that hidden text is appearing on page. Could it be hidden by something else except clip path? thanks!

1 Answers1

1

Could it be hidden by something else except clip path?

Yes. In case of your new document the text is written in white on white, e.g. the 387986 after ASSETS is drawn like this:

1 1 1 rg
/TT0 16 Tf
-1011.938 115.993 Td
(@A,BAC)Tj 

The initial 1 1 1 rg sets the fill color to RGB WHITE. (Additionally that text is quite tiny but would still be visible if drawn in e.g. BLACK.)

The solution you refer to was implemented for documents like the sample document presented in that issue in which the invisible text is made invisible by defining clip paths (outside the bounds of which the text is) and by filling paths (hiding the text underneath). Thus, your white text won't be recognized by it as hidden.

Unfortunately recognizing invisibility of WHITE on WHITE text is more difficult to determine than that of clipped or covered text because one not only needs to know the a property of the current graphics state (like the clip path) or remove all text inside a given path, one also needs to know the color of the part of the page right before the text is drawn (to check the on WHITE detail).

If, on the other hand, you assume the page background to be essentially WHITE, it is fairly simple to ignore all white text: Simply also detect the current fill color in processTextPosition:

PDColor fillColor = gs.getNonStrokingColor();

and compare it to the flavors of WHITE you want to consider invisible. (Usually it should suffice to compare with RGB, CMYK, and Grayscale WHITE; in seldom cases you'll also have to correctly interpret more complex color spaces. Additionally you might also consider nearly WHITE colors invisible, (.99, .99, .99) RGB can hardly be distinguished from WHITE.)

If you find the current color to be WHITE, ignore the current TextPosition.

Be aware, though, just like the solution you referenced this is not yet the final solution recognizing all WHITE text: For that you'll also have to check the text rendering mode: If it is just filling (the default), the above holds, but if it is (also) stroking, you'll (also) have to consider the stroking color; if it is rendered invisible, there is no color to consider; and if the text rendering mode includes adding to path for clipping, you'll have to wait and determine what will be later drawn in this part of the page as long as the clip path holds, definitely not trivial!

mkl
  • 90,588
  • 15
  • 125
  • 265
  • 1
    Thank you so much for so deep explanation. This is definitely the case. Just one think, I needed to add all those "Stroke" operators to my custom text stripper (addOperator(new SetStrokingColor()) etc) to process color instuctions. – D.F. Stones Feb 16 '18 at 19:29
  • 1
    indeed, the text stripper as is only registers the operator listeners required for simple text extraction. – mkl Feb 16 '18 at 20:19