0

I'm parsing a PDF using PDFBox and I'm trying to get the text color. I can get other properties like font, size, and position no problem using TextPosition attributes. Here's how I'm doing it:

@Override
protected void writeString (String string, List<TextPosition> textPositions) {

    for (TextPosition textPosition : textPositions) {

        System.out.println(textPosition.getFont());
        System.out.println(textPosition.getFontSizeInPt());
        System.out.println(textPosition.getXDirAdj() + ", " + textPosition.getYDirAdj());

    }

However, I'm unable to retrieve the color of the text. I've searched Google for a solution but nothing has worked so far. Every tutorial I see seems to be using an old version of PDFBox. I don't have several of the methods that these people are using. For example, in one SO question they recommended using this code:

@Override
protected void processTextPosition(TextPosition text) {

    try {
        PDGraphicsState graphicsState = getGraphicsState();
        System.out.println("R = " + graphicsState.getNonStrokingColor().getJavaColor().getRed());
        System.out.println("G = " + graphicsState.getNonStrokingColor().getJavaColor().getGreen());
        System.out.println("B = " + graphicsState.getNonStrokingColor().getJavaColor().getBlue());
    }

    catch (IOException ioe) {}

}

When I try to use this, IntelliJ tells me "getJavaColor()" is undefined. I have also tried with this code:

@Override
protected void processTextPosition(TextPosition text) {

    try {
        PDGraphicsState graphicsState = getGraphicsState();
        System.out.println("R = " + graphicsState.getNonStrokingColor().toRGB());
    }
    catch (IOException ioe) {System.out.println(ioe); }

}

And, while the method is getting called as expected, and the expected number of times, it always prints 0, even though in my PDF file I have black text and red text.

Here are my Maven dependencies:

<dependencies>

    <!-- https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox -->
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>2.0.17</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.pdfbox/fontbox -->
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>fontbox</artifactId>
        <version>2.0.17</version>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox-tools -->
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox-tools</artifactId>
        <version>2.0.17</version>
    </dependency>

</dependencies>

Any help is appreciated

David Antelo
  • 503
  • 6
  • 19
  • It is not, that's where I got the non working code from. That question is old and a few of the methods used no longer exist – David Antelo Nov 25 '19 at 12:27
  • 1
    Could you please share your file? Maybe the red text is a vector graphic, or a type 3 font where the color is within the charprocs. The best code is `PrintTextColors.java` from the source code download. – Tilman Hausherr Nov 25 '19 at 13:15
  • 1
    It could also be a very special color type without PDFBox knowing how to RGB values for it. Or the text could be drawn as a clip path and the color gets there only during a later fill in that area. Or the text could be drawn in black but changed its color due to the blend mode at that time or a funny blend mode application later. Or... or... or... Because of such possibilities I wouldn't start looking into such issues without the exact document is shared to investigate. – mkl Nov 25 '19 at 13:41
  • Thanks for the ideas guys, just found the solution, I'll post it as the answer – David Antelo Nov 25 '19 at 14:21

1 Answers1

1

Apparently in PDFBox 2.0.0+ versions you need to add these lines of code:

addOperator(new SetStrokingColorSpace());
addOperator(new SetNonStrokingColorSpace());
addOperator(new SetStrokingDeviceCMYKColor());
addOperator(new SetNonStrokingDeviceCMYKColor());
addOperator(new SetNonStrokingDeviceRGBColor());
addOperator(new SetStrokingDeviceRGBColor());
addOperator(new SetNonStrokingDeviceGrayColor());
addOperator(new SetStrokingDeviceGrayColor());
addOperator(new SetStrokingColor());
addOperator(new SetStrokingColorN());
addOperator(new SetNonStrokingColor());
addOperator(new SetNonStrokingColorN());

to your PDFTextStripper overwritten class constructor. Now if you use:

@Override
protected void processTextPosition (TextPosition textPosition) {

    try {

        PDGraphicsState graphicsState = getGraphicsState();
        System.out.println(graphicsState.getNonStrokingColor().toRGB());

    }

    catch (Exception ioe) {}

}

it actually prints a real value.

David Antelo
  • 503
  • 6
  • 19
  • *"Apparently in PDFBox 2.0.0+ versions you need to add these lines of code"* - You needed that, too, in the pre-2 versions. For example see [this old answer](https://stackoverflow.com/a/21453780/1729265). As you appeared to have looked at the old answers, I thought you knew and that couldn't be the issue. – mkl Nov 25 '19 at 14:29
  • I didn't go that far, I only looked at the top rated ones – David Antelo Nov 25 '19 at 14:33
  • Although it indeed prints a real value, the value is not associated with the text's color and therefore is of no use :P. It printed the same color value for every line when processing my pdf, and this shouldn't have been the case. – Michael Sinclair Apr 07 '21 at 17:44