Using PDFBox to get location of line of text

Question

I'm using PDFBox to extract information from a pdf, and the information I'm currently trying to find is related to the x-position of the first character in the line. I can't find anything related to how to get that information though. I know pdfbox has a class called TextPosition, but I can't find out how to get a TextPosition object from the PDDocument either. How do I get the location information of a line of text from a pdf?

There are multiple samples showing how to get `TextPosition` objects from a document, e.g. [his answer](http://stackoverflow.com/a/20924898/1729265) in the section *The general procedure and a PDFBox issue*. The *issue* meanwhile has been resolved. — mkl, Oct 06 '15 at 21:09
@mkl How is writeString called? It's protected, so it's likely called from within another method in TextStripper, but I'm not sure which one. I tried the solution in the next answer, about the charactersByArticle, but the vector I got as a result was empty. — Beez, Oct 07 '15 at 03:11
*How is writeString called* - you apply a `PDFTextStripper` instance to your document and that instance calls `writeString` again and again. — mkl, Oct 07 '15 at 08:10
*I tried the solution in the next answer, about the charactersByArticle* - that only makes sense for pdfs which contain certain additional meta information separating multiple articles in the document. If your PDF does not have such information, `charactersByArticle` won't help. — mkl, Oct 07 '15 at 09:28
Sorry, I'm brand new to looking at pdfs, and I feel like you're referencing things you think I should know but I don't. You say apply a PDFTextStripper instance to my document and that will do it, but how do I do that? I've tried calling startDocument and getText, and neither of those ran the code in the new writeString method. — Beez, Oct 07 '15 at 15:28
Ok, I'll try and write something up later. Currently I'm on a smart phone which is not optimal for providing in-depth samples. — mkl, Oct 07 '15 at 15:56

mkl · Accepted Answer · 2015-10-09T10:33:47.543

In general

To extract text (with or without extra information like positions, colors, etc.) using PDFBox, you instantiate a PDFTextStripper or a class derived from it and use it like this:

PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);

(There are a number of PDFTextStripper attributes allowing you to restrict the pages text is extracted from.)

In the course of the execution of getText the content streams of the pages in question (and those of form xObjects referenced from those pages) are parsed and text drawing commands are processed.

If you want to change the text extraction behavior, you have to change this text drawing command processing which you most often should do by overriding this method:

/**
 * Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
 * and just calls {@link #writeString(String)}.
 *
 * @param text The text to write to the stream.
 * @param textPositions The TextPositions belonging to the text.
 * @throws IOException If there is an error when writing the text.
 */
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
    writeString(text);
}

If you additionally need to know when a new line starts, you may also want to override

/**
 * Write the line separator value to the output stream.
 * @throws IOException
 *             If there is a problem writing out the lineseparator to the document.
 */
protected void writeLineSeparator( ) throws IOException
{
    output.write(getLineSeparator());
}

writeString can be overridden to channel the text information into separate members (e.g. if you might want a result in a more structured format than a mere String) or it can be overridden to simply add some extra information into the result String.

writeLineSeparator can be overridden to trigger some specific output between lines.

There are more methods which can be overridden but you are less likely to need them in general.

In the case at hand

I'm using PDFBox to extract information from a pdf, and the information I'm currently trying to find is related to the x-position of the first character in the line.

This can be implemented as follows (simply adding the information at the start of each line):

PDFTextStripper stripper = new PDFTextStripper()
{
    @Override
    protected void startPage(PDPage page) throws IOException
    {
        startOfLine = true;
        super.startPage(page);
    }

    @Override
    protected void writeLineSeparator() throws IOException
    {
        startOfLine = true;
        super.writeLineSeparator();
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        if (startOfLine)
        {
            TextPosition firstProsition = textPositions.get(0);
            writeString(String.format("[%s]", firstProsition.getXDirAdj()));
            startOfLine = false;
        }
        super.writeString(text, textPositions);
    }
    boolean startOfLine = true;
};

text = stripper.getText(document);

(ExtractText.java method extractLineStart tested by testExtractLineStartFromSampleFile)

This answer helped me tremendously. I also found my problem before with calling getText, I had put in a getText function myself before I knew to extend PDFTextStripper, and that was keeping it from calling the new writeString function. Thanks! — Beez, Oct 09 '15 at 03:39
@Beez can you share your code, i am also stuck with this type of problem. I want to change the color blue to black from text (which starts with 'http' OR 'https'). — Asad Rao, Oct 29 '19 at 10:37

Using PDFBox to get location of line of text

1 Answers1

In general

In the case at hand

Linked