2

I am reading text from PDF using pdfbox library and saving it in text file. It reads hidden text as well which is not visible when PDF is viewed through PDF Reader. My requirement is to get some characteristics of these hidden text which can distinguish it from normal text.

jkj
  • 25
  • 5
  • For some ideas have a look at the [questions of D.F. Stones](https://stackoverflow.com/users/9123040/d-f-stones?tab=questions) and the respective answers; here a number of options have been visited. – mkl Sep 17 '20 at 17:30
  • Thanks @mkl for sharing your views. I came across few solutions shared by you and tried applying it and it did work to remove some of the hidden text of PDF. But still there is some hidden text in the PDF which is rotated at particular angle and I need to remove that text as well from the output. Could you please share some of your insights to resolve this issue? – jkj Sep 23 '20 at 07:28
  • FYI.. I used the solution mentioned here. https://stackoverflow.com/questions/47908124/pdfbox-removing-invisible-text-by-clip-filling-paths-issue – jkj Sep 23 '20 at 07:29
  • *"But still there is some hidden text in the PDF which is rotated at particular angle and I need to remove that text as well from the output. Could you please share some of your insights to resolve this issue?"* - Please share the PDF in question for analysis. – mkl Sep 23 '20 at 15:26
  • @mkl https://drive.google.com/file/d/1jFhF9y8jh_tr9POU258Fvn9WDRGdv-4-/view?usp=drivesdk – jkj Sep 24 '20 at 06:20
  • The "DRAFT - UNAUDITED" in that file is drawn in white (CMYK 0 0 0 0 to be exact) very early in the page drawing process. Thus, it is *not invisible* as in *covered by something* or *outside the current clip path*, it is white text on a white background in plain sight. – mkl Sep 24 '20 at 13:58
  • Ok fine. I have another PDF which has same text in silver color. How to identify and remove text here? Could you please check and guide me? https://drive.google.com/file/d/1bEcpJheSWTl29B1SGheSv34k9S6VIMeb/view?usp=drivesdk – jkj Sep 25 '20 at 13:04
  • The *silver* actually is a specific gray with value 0.753 in a Gray Gamma 2.2 XYZ **ICCBased** colorspace. – mkl Sep 25 '20 at 14:31
  • By the way: *"How to identify and remove text here?"* - By *remove text* do you mean actually removing it from the PDF (to prevent it from appearing in copy&paste from the viewer) or merely from the text you extract? – mkl Sep 25 '20 at 17:23
  • I meant from text that is extracted – jkj Sep 25 '20 at 19:54
  • @mkl so how to remove this silver invisible text? – jkj Sep 26 '20 at 14:19

1 Answers1

1

One possible criterion for the texts to ignore in your example files is the text color, pure CMYK white in one case, 0.753 in a Gray Gamma 2.2 XYZ ICCBased colorspace in the other case.

So let's extend the text stripper by a color filtering option. This in particular means adding operator processors for color setting instructions as the PDFTextStripper by default ignores them:

public class PDFFilteringTextStripper extends PDFTextStripper {
    public interface TextStripperFilter {
        public boolean accept(TextPosition text, PDGraphicsState graphicsState);
    }

    public PDFFilteringTextStripper(TextStripperFilter filter) throws IOException {
        addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingColorSpace());
        addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorSpace());
        addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingColor());
        addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColor());
        addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingColorN());
        addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorN());
        addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingDeviceGrayColor());
        addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingDeviceGrayColor());
        addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingDeviceRGBColor());
        addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingDeviceRGBColor());
        addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingDeviceCMYKColor());
        addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingDeviceCMYKColor());

        this.filter = filter;
    }

    @Override
    protected void processTextPosition(TextPosition text) {
        PDGraphicsState graphicsState = getGraphicsState();
        if (filter.accept(text, graphicsState))
            super.processTextPosition(text);
    }

    final TextStripperFilter filter;
}

(PDFFilteringTextStripper class)

Using that text stripper class, we can filter the white text from the first example PDF like this:

float[] colorToFilter = new float[] {0,0,0,0};

PDDocument document = ...;
PDFFilteringTextStripper stripper = new PDFFilteringTextStripper((text, gs) -> {
    PDColor color = gs.getNonStrokingColor();
    return color == null || !((color.getColorSpace() instanceof PDDeviceCMYK) && Arrays.equals(color.getComponents(), colorToFilter));
});
String text = stripper.getText(document);

(ExtractFilteredText test testExtractNoWhiteText...)

Similarly we can filter the gray text from the second example PDF like this:

float[] colorToFilter = new float[] {0.753f};

PDDocument document = ...;
PDFFilteringTextStripper stripper = new PDFFilteringTextStripper((text, gs) -> {
    PDColor color = gs.getNonStrokingColor();
    return color == null || !((color.getColorSpace() instanceof PDICCBased) && Arrays.equals(color.getComponents(), colorToFilter));
});
String text = stripper.getText(document);

(ExtractFilteredText test testExtractNoGrayText...)


In a comment you asked

A quick question- this text in 0.753 in a Gray Gamma 2.2 XYZ ICCBased colorspace - invisible text? Or is it just because of the colorspace, text is not visible in PDF?

It is visible! (Thus, strictly speaking you should not remove it from the extracted text.)

It merely is quite small. On the title page zoom in on the year "2016":

"2016" with small "DRAFT - UNAUDITED"

mkl
  • 90,588
  • 15
  • 125
  • 265
  • This worked, thanks :) A quick question- this text in 0.753 in a Gray Gamma 2.2 XYZ ICCBased colorspace - invisible text? Or is it just because of the colorspace, text is not visible in PDF? – jkj Sep 28 '20 at 09:54
  • I'll have to verify. I'll look later. – mkl Sep 28 '20 at 11:22
  • Ok sure. Thanks again :) – jkj Sep 28 '20 at 12:49
  • See the edit to my answer, it is not invisible at all, merely very small. – mkl Sep 28 '20 at 15:34
  • Ohh yes its so strange. Could you please check this PDF ? It has some text in footer "Green bonds - Made by KfW...." This text is searchable and color is dimgray but not visible. What could be the reason?https://drive.google.com/file/d/1epCmrJ1lsM9o5X_m3xgVVqhmyYY6Lf1O/view?usp=drivesdk – jkj Sep 29 '20 at 10:50
  • In the content stream that footer text up to and including the page number is drawn first, then a large white rectangle covering it, making that footer text invisible, and only then the text you see is drawn. – mkl Sep 30 '20 at 09:07
  • ok so how to ignore this text while doing extraction since its not visible in PDF and so not required to be extracted. – jkj Sep 30 '20 at 09:14
  • Use the [`PDFVisibleTextStripper`](https://github.com/mkl-public/testarea-pdfbox2/blob/master/src/main/java/mkl/testarea/pdfbox2/extract/PDFVisibleTextStripper.java) originally from [this answer](https://stackoverflow.com/a/47396555/1729265). You can adapt the `PDFFilteringTextStripper` here to extend the `PDFVisibleTextStripper` instead of the `PDFTextStripper`. – mkl Sep 30 '20 at 09:40
  • Ok will try that. Meanwhile i have a quick question - if some text is truly invisible in PDF then is it true that we will always get a negative getYDirAdj() value for that text? – jkj Sep 30 '20 at 10:18
  • What do you mean by "*truly* invisible"? – mkl Sep 30 '20 at 12:57
  • lol.. I mean the text which is not of white color on white background or something with too small font size and thus is invisible. – jkj Oct 01 '20 at 05:51
  • Well, there are still numerous ways for text to be not visible, and only a few of them imply a negative `getYDirAdj()` value (namely those that have the text at the top, outside the page borders). The others rely on details like a text rendering mode drawing nothing (mode 3 - *invisible* - or 7 - *clip-path*), a blend mode resulting in no change (e.g. in mode *Lighten* or *Difference* drawing anything black makes no difference.), clip paths the text is drawn outside of, font files with only blank glyphs, etc etc etc – mkl Oct 01 '20 at 09:30
  • Ohh ok got it. Thanks a lot for sharing this info. Is there any more reference material or links online where I can read more about this stuff "The others rely on details like a text rendering mode drawing nothing (mode 3 - invisible - or 7 - clip-path), a blend mode resulting in no change (e.g. in mode Lighten or Difference drawing anything black makes no difference.), clip paths the text is drawn outside of, font files with only blank glyphs, etc etc etc" – jkj Oct 01 '20 at 09:51
  • I use the PDF specification ISO 32000. A copy of the first version (from 2008) has been published by Adobe for free download (merely the ISO headers are missing), simply google for "PDF32000_2008". The second version (from 2017) is not officially available for free. – mkl Oct 01 '20 at 10:00
  • Ok I will refer it. Thanks for sharing and thanks for all the help so far :) – jkj Oct 01 '20 at 13:40