Get colours from fonts in PDFBox

Question

I am trying to get the font colour from PDFBox and I seem to keep throwing an exception. Can someone help? The way I tried to obtain the colour was (page is the PDPage I obtained):

PDResources = page.getResources();
Iterable<COSName> fontNames = resources.getFontNames();
for (COSName fontName:fontNames)
   System.out.println("name: " + resources.getFont(fontName).getName() + 
                      "colour: " + resources.getColorSpace(fontName).getName());

This prints out the exception:

org.apache.pdfbox.pdmodel.MissingResourceException: Missing color space: F1

Could someone tell me how to properly get the colour of a font obtained in this manner?

Fonts don't have a color at all. They can be painted with a stroking or a non stroking color or both and even more. To see what I mean, look at the PDF files here with Adobe Reader (not with firefox): https://issues.apache.org/jira/browse/PDFBOX-678 . And you can even have text that is cut out from an image, or a shading so one single glyph could have several colors. Do you know in advance that your PDF files won't use any of the "interesting" modes? — Tilman Hausherr, Jul 14 '16 at 14:42
@TilmanHausherr I see what you mean, yes I am sure that there would be no such edge cases in my pdf files. Would it be possible to get any colour information from this in this case? — kabeersvohra, Jul 14 '16 at 14:44
Im not sure what a stroking colour is. Would this be what I require? Even if there was a weird case like this and the algorithm outputted one of the font colours that appeared that would be sufficient for my use-case — kabeersvohra, Jul 14 '16 at 14:46
Stroking color is for lines, a non stroking color is for fills. If you don't know what a "stroke color" is, then you can't be sure that there aren't these special cases. I've seen seemingly "boring" files that did have them. Re color, see here: https://stackoverflow.com/questions/21430341/identifying-the-text-based-on-the-output-in-pdf-using-pdfbox/21453780#21453780 read also the migration guide https://pdfbox.apache.org/2.0/migration.html , the part "In 1.8, to get the text colors". And yes it is still tricky. — Tilman Hausherr, Jul 14 '16 at 14:56
@TilmanHausherr I tried to use the method shown above and it doesnt work because the registerOperatorProcessor method has been deprecated and org.apache.pdfbox.util.operator class has been removed in v2.0. Is there any other way that works with the latest version of PDFBox? — kabeersvohra, Jul 15 '16 at 08:28
That's why I pointed you to the migration page. You need to use `addOperator()`. Oh. I just see that there is an example program, and I even coauthored it. LOL. I knew I had done something, but thought it was on the mailing list. See upcoming answer. — Tilman Hausherr, Jul 15 '16 at 09:03
Haha sorry for all the questions, thanks for the answer. I will implement it and then let you know :) — kabeersvohra, Jul 15 '16 at 09:12

score 1 · Accepted Answer · answered Jul 15 '16 at 09:04

Try PrintTextColors from the source code download:

/**
 * This is an example on how to get the colors of text. Note that this will not tell the background,
 * and will only work properly if the text is not overwritten later, and only if the text rendering
 * modes are 0, 1 or 2. In the PDF 32000 specification, please read 9.3.6 "Text Rendering Mode" to
 * know more. Mode 0 (FILL) is the default. Mode 1 (STROKE) will make glyphs look "hollow". Mode 2
 * (FILL_STROKE) will make glyphs look "fat".
 *
 * @author Ben Litchfield
 * @author Tilman Hausherr
 */
public class PrintTextColors extends PDFTextStripper
{
    /**
     * Instantiate a new PDFTextStripper object.
     *
     * @throws IOException If there is an error loading the properties.
     */
    public PrintTextColors() throws IOException
    {
        addOperator(new SetStrokingColorSpace());
        addOperator(new SetNonStrokingColorSpace());
        addOperator(new SetStrokingDeviceCMYKColor());
        addOperator(new SetNonStrokingDeviceCMYKColor());
        addOperator(new SetNonStrokingDeviceRGBColor());
        addOperator(new SetStrokingDeviceRGBColor());
        addOperator(new SetNonStrokingDeviceGrayColor());
        addOperator(new SetStrokingDeviceGrayColor());
        addOperator(new SetStrokingColor());
        addOperator(new SetStrokingColorN());
        addOperator(new SetNonStrokingColor());
        addOperator(new SetNonStrokingColorN());
    }

    /**
     * This will print the documents data.
     *
     * @param args The command line arguments.
     *
     * @throws IOException If there is an error parsing the document.
     */
    public static void main(String[] args) throws IOException
    {
        if (args.length != 1)
        {
            usage();
        }
        else
        {
            PDDocument document = null;
            try
            {
                document = PDDocument.load(new File(args[0]));

                PDFTextStripper stripper = new PrintTextColors();
                stripper.setSortByPosition(true);
                stripper.setStartPage(0);
                stripper.setEndPage(document.getNumberOfPages());

                Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
                stripper.writeText(document, dummy);
            }
            finally
            {
                if (document != null)
                {
                    document.close();
                }
            }
        }
    }

    @Override
    protected void processTextPosition(TextPosition text)
    {
        super.processTextPosition(text);

        PDColor strokingColor = getGraphicsState().getStrokingColor();
        PDColor nonStrokingColor = getGraphicsState().getNonStrokingColor();
        String unicode = text.getUnicode();
        RenderingMode renderingMode = getGraphicsState().getTextState().getRenderingMode();
        System.out.println("Unicode:            " + unicode);
        System.out.println("Rendering mode:     " + renderingMode);
        System.out.println("Stroking color:     " + strokingColor);
        System.out.println("Non-Stroking color: " + nonStrokingColor);
        System.out.println("Non-Stroking color: " + nonStrokingColor);
        System.out.println();

        // See the PrintTextLocations for more attributes
    }

    /**
     * This will print the usage for this document.
     */
    private static void usage()
    {
        System.err.println("Usage: java " + PrintTextColors.class.getName() + " <input-pdf>");
    }
}

Ok I am implementing it, I set the start page to 0 and the end page to the number of pages which returned an IOException (the document is one page). I then set the start page to 1 and the end page to 1 as well and the getText function as well as the writeText function returned an empty string — kabeersvohra, Jul 15 '16 at 09:59
Maybe you don't have any text to extract in that PDF. This can happen to prevent text extraction. Try the code on another PDF. — Tilman Hausherr, Jul 15 '16 at 10:03
yeah I have just done testing with multiple different PDF documents that all have text on them, would you like me to send them to you? — kabeersvohra, Jul 15 '16 at 10:24
I just tried with this file: https://www.einfach-fuer-alle.de/artikel/checkliste-barrierefreie-pdf/Checkliste-Barrierefreies-PDF.pdf you can upload your file to a sharehoster. — Tilman Hausherr, Jul 15 '16 at 10:27
Sorry, it seems that there is another problem with my code not related to this since I took this snippet out into a new project to test it and it works. Thank you for all your help — kabeersvohra, Jul 15 '16 at 10:44
So the error has ended up being a indexing error... I was passing in the index that I was getting earlier into the start page function which was 0 giving no output. Feel so dumb haha — kabeersvohra, Jul 15 '16 at 12:37
I have found that processTextPosition only works for new letters not repeated letters. Is there any way of making it process for each letter? An example is when getText returns Dgdxgdd, processTextPosition only returns information about D, g, d and x — kabeersvohra, Jul 19 '16 at 08:48
I can't comment on that without the pdf. Sometimes the stuff in getText is not in the same sequence than in processTextPosition. — Tilman Hausherr, Jul 19 '16 at 11:54
Do you know how to extend this answer to include font attributes as well (bold, underline, italics). I use this PDFstripper to extract the texts, the fonts and the colours now and it works well. Is there an operator I can add or a way to extract the attribute information from the textState object? Thanks — kabeersvohra, Jul 27 '16 at 10:40
@KVohra95 there is no such thing as "underline attribute". Lines are just lines. Re bold / italics - this can sometimes be deduced from the font name, but not always - the best would be to make a list of all possible bold fonts and also use some heuristics. And it gets worse: in theory (does not happen often), italics can also be done with a skew matrix, and bold can also be simulated by using a stroke + fill font rendering mode. — Tilman Hausherr, Jul 27 '16 at 10:48
This solution doesn't work as the color component printed is always 0.0 and therefore undefined. — Michael Sinclair, Apr 07 '21 at 20:09
@MichaelSinclair `This solution doesn't work as the color component printed is always 0.0 and therefore undefined` please create a new question on that one please; I see you did create a question, an answer, and then deleted it, so this is very confusing. Alternatively create an issue in the PDFBox JIRA. — Tilman Hausherr, Apr 08 '21 at 06:46
@TilmanHausherr thank you for embarrassing me. JK haha, I just posted a more refined question and answer which should be able to more simply answer this question as well. Thanks for the advice! https://stackoverflow.com/questions/67026428/pdfbox-how-to-load-color-from-text — Michael Sinclair, Apr 09 '21 at 18:24

Get colours from fonts in PDFBox

1 Answers1

Linked