0

File example: test

Here in the 2nd row in the table, after "3500 RENT" there are 2 text tokens("1", "1") returned by PdfTextStripper but actually not visible in the original PDF. I know that it could be a clip path (like in the post here) or a color issue (like in the post here).

However, it looks like in this case it's hidden by some other means... the clip path does not overlap and the color is black for those tokens.

What else could it be?

mkl
  • 90,588
  • 15
  • 125
  • 265

1 Answers1

2

It is a color issue, the '1's are printed in white.

What makes the situation a bit special is that the ColorSpace in use is not your off-the-shelf DeviceRGB or DeviceGray but a Separation color space, and color values in Separation color spaces are always treated as subtractive colors. Thus, a tint value of 0.0 denotes the lightest color that can be achieved with the given colorant, and 1.0 is the darkest. This convention is the same as for DeviceCMYK color components but opposite to the one for DeviceGray and DeviceRGB.

(cf. ISO 32000-1 section 8.6.6.4 "Separation Colour Spaces")

Inside view

Your content stream starts like this:

/Cs8 cs 1 scn

Cs8 is a Separation color space:

/Cs8 [/Separation /Black [/ICCBased 17 0 R] 18 0 R] 

with an ICCBased alternate space which in turn has DeviceRGB as alternate space

17 0 obj
<<
/Length 2597
/Alternate /DeviceRGB
/Filter /FlateDecode
/N 3
>>
stream
[...ICC profile...]
endstream
endobj 

and a tint transform by samples to the alternate color space

18 0 obj
<<
/Length 779
/BitsPerSample 8
/Decode [0 1 0 1 0 1]
/Domain [0 1]
/Encode [0 254]
/Filter /FlateDecode
/FunctionType 0
/Range [0 1 0 1 0 1]
/Size [255]
>>
stream
[...255 samples from (255,255,255) to (35,31,32)...]
endstream
endobj 

Your content stream continues with operations drawing the headers and the start of the first row and then

/TT2 1 Tf
0 scn
13.559 0 TD
6.8438 Tc
<00140014>Tj
1 scn 

0 scn sets the color to the lightest Cs8 BLACK separation color which is mapped by sample to (255,255,255) on screen which will be pretty white, 6.8438 Tc sets a large character spacing (resulting in the gap between the two '1's), <00140014>Tj draws the two '1's, and 1 scn switches back to the darkest Cs8 BLACK separation color mapped by sample to (35,31,32) on screen which will be a very dark grayish color.

With PDFBox

In a comment you say

when I debug it in processTextPosition(TextPosition text), gs.getNonStrokingColor() has same value for those "1" tokens as for others tokens and is actually black

To recognize this with PDFBox, you have to tell its PDFTextStripper to look for the generic color space selection and color selection operators cs and scn and extend processTextPosition like in this proof-of-concept:

PDFTextStripper stripper = new PDFTextStripper() {
    @Override
    protected void processTextPosition(TextPosition text) {
        PDGraphicsState gs = getGraphicsState();
        PDColor color = gs.getNonStrokingColor();
        float[] currentComponents = color.getComponents();
        if (!Arrays.equals(components, currentComponents)) {
            System.out.print(Arrays.toString(currentComponents));
            components = currentComponents;
        }
        System.out.print(text.getUnicode());
        super.processTextPosition(text);
    }
    
    float[] components;
};

stripper.addOperator(new SetNonStrokingColorSpace());
stripper.addOperator(new SetNonStrokingColorN());

(ExtractText test testTestSeparation)

With these settings in place you get

[1.0]TenantLeaseStart ... 3,500.00RENT[0.0]11[1.0]16,133.33

As you see the color component starts with 1.0, for the two '1's it is 0.0, and thereafter it becomes 1.0 again until the next run of invisible '1's.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • However, when I debug it in processTextPosition(TextPosition text) , gs.getNonStrokingColor() has same value for those "1" tokens as for others tokens and is actually black... Is this some other property I can check to handle it? thanks in advance looks like there is nobody else knowing this PDFBox lib... – D.F. Stones Apr 26 '18 at 18:01
  • @DmitryK I added a proof-of-concept to my answer to show how to see those color changes. – mkl Apr 27 '18 at 08:50
  • well, but does this mean that [0.0] always means text is invisible? Looks like no, because in others documents when float[] currentComponents = color.getComponents(); is [0.0] character is visible on screen. So this does not exactly tell text is invisible... – D.F. Stones Apr 29 '18 at 13:43
  • The meaning of those numbers of course has to be determined in relation to the color space. In the code above I merely showed that your claim that the colors of the '1's and of the other text both were reported as black by PDFBox, is wrong if one registers the appropriate operators. – mkl Apr 29 '18 at 15:18
  • oh got it, so in SetNonStrokingColorSpace I should get this color space definition and then use it in processTextPosition? – D.F. Stones Apr 29 '18 at 16:41
  • `SetNonStrokingColorSpace` will cause the non stroking colour space entry in the current `PDGraphicsState` instance to be properly set. You should query that entry. Ah, of course you will also have to register all those colour setting operations which implicitly set a colour space... – mkl Apr 29 '18 at 19:29