1

Is there a way to preserve the text formatting when extracting a PDF with PDFBox?

I have a program that parses a PDF document for information. When a new version of the PDF is released the authors use bold or italic text to indicate new information and Strike through or underlined to indicated omitted text. Using the base Stripper class in PDFbox returns all the text but the formatting is removed so I have no way of telling if the text is new or omitted. I'm currently using the project example code below:

    Dim doc As PDDocument = Nothing

    Try
        doc = PDDocument.load(RFPFilePath)
        Dim stripper As New PDFTextStripper()

        stripper.setAddMoreFormatting(True)
        stripper.setSortByPosition(True)
        rtxt_DocumentViewer.Text = stripper.getText(doc)

    Finally
        If doc IsNot Nothing Then
            doc.close()
        End If
    End Try

I have my parsing code working well if I simply copy and paste the PDF text into a richtextbox which preservers the formatting. I was thinking of doing this programatically by opening the PDF, select all, Copy, close the document then paste it in my richtextbox but that seems clunky.

Neelix
  • 143
  • 1
  • 8
  • *"the authors use bold or italic text to indicate new information and Strike through or underlined to indicated omitted text"* - do they use different fonts for that? Or do they use poor man's bold etc. emulations? – mkl Oct 10 '16 at 17:31
  • I believe these started as msword documents then were converted to PDF. If you were to copy/paste the text into a word document the font remains the same with the Bold/Italics or Strikethrough attribute enabled. – Neelix Oct 10 '16 at 21:06
  • That does not answer my question. If you do not know, please share documents to demonstrate. – mkl Oct 11 '16 at 21:20
  • Thank you for the help and I guess I don't understand your question. The document is nothing special, just a word doc converted to PDF. I created an example document with the same formatting I'm encountering here:http://www.filedropper.com/exampledocument – Neelix Oct 12 '16 at 02:04
  • The bold and italic effects in your sample document are generated by using a different font (containing bold or italic versions of the letters) to draw the text. The underline and strike-through effects in your sample document are generated by drawing a rectangle under / through the text line which has the width of the text line and a very small height. To extract these information, therefore, you have to extend the `PDFTextStripper` to somehow react to font changes and rectangles nearby text. – mkl Oct 13 '16 at 15:58
  • I'm only using PDFBox with Java and, therefore can only provide example code in Java. If that would be ok, I also need to know the PDFBox version you use. In particular, is it a 1.8.x or a 2.0.x? – mkl Oct 13 '16 at 15:59
  • Yes Java would be fine thank you. I have version 1.8.9 but I'm not set on a particular version. – Neelix Oct 13 '16 at 20:04
  • Is the code in my answer helping you? Or are there any issues? – mkl Oct 17 '16 at 10:29
  • I believe I understand the structure of your code but I'm still trying to figure out how to implement it in .NET. Can you provide your imports from pdfbox? – Neelix Oct 17 '16 at 23:51
  • If you follow the link right under the code in the answer, you'll find the whole java source file of the respective class. – mkl Oct 18 '16 at 04:23
  • have you succeeded porting the code to .Net? – mkl Nov 15 '16 at 05:15
  • No, I attempted to convert it over to VB.NET but it was a mess. I planned to start from scratch and follow your logic but the project fell on the back burner and I haven't had a chance to look into it further. – Neelix Nov 17 '16 at 14:23

1 Answers1

6

As the OP mentioned in a comment that a Java example would do and I've yet only used PDFBox with Java, this answer features a Java example. Furthermore, this example has been developed and tested with PDFBox version 1.8.11 only.

A customized text stripper

As already mentioned in a comment,

The bold and italic effects in the OP's sample document are generated by using a different font (containing bold or italic versions of the letters) to draw the text. The underline and strike-through effects in the sample document are generated by drawing a rectangle under / through the text line which has the width of the text line and a very small height. To extract these information, therefore, one has to extend the PDFTextStripper to somehow react to font changes and rectangles nearby text.

This is an example class extending the PDFTextStripper just like that:

public class PDFStyledTextStripper extends PDFTextStripper
{
    public PDFStyledTextStripper() throws IOException
    {
        super();
        registerOperatorProcessor("re", new AppendRectangleToPath());
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        for (TextPosition textPosition : textPositions)
        {
            Set<String> style = determineStyle(textPosition);
            if (!style.equals(currentStyle))
            {
                output.write(style.toString());
                currentStyle = style;
            }
            output.write(textPosition.getCharacter());
        }
    }

    Set<String> determineStyle(TextPosition textPosition)
    {
        Set<String> result = new HashSet<>();

        if (textPosition.getFont().getBaseFont().toLowerCase().contains("bold"))
            result.add("Bold");

        if (textPosition.getFont().getBaseFont().toLowerCase().contains("italic"))
            result.add("Italic");

        if (rectangles.stream().anyMatch(r -> r.underlines(textPosition)))
            result.add("Underline");

        if (rectangles.stream().anyMatch(r -> r.strikesThrough(textPosition)))
            result.add("StrikeThrough");

        return result;
    }

    class AppendRectangleToPath extends OperatorProcessor
    {
        public void process(PDFOperator operator, List<COSBase> arguments)
        {
            COSNumber x = (COSNumber) arguments.get(0);
            COSNumber y = (COSNumber) arguments.get(1);
            COSNumber w = (COSNumber) arguments.get(2);
            COSNumber h = (COSNumber) arguments.get(3);

            double x1 = x.doubleValue();
            double y1 = y.doubleValue();

            // create a pair of coordinates for the transformation
            double x2 = w.doubleValue() + x1;
            double y2 = h.doubleValue() + y1;

            Point2D p0 = transformedPoint(x1, y1);
            Point2D p1 = transformedPoint(x2, y1);
            Point2D p2 = transformedPoint(x2, y2);
            Point2D p3 = transformedPoint(x1, y2);

            rectangles.add(new TransformedRectangle(p0, p1, p2, p3));
        }

        Point2D.Double transformedPoint(double x, double y)
        {
            double[] position = {x,y}; 
            getGraphicsState().getCurrentTransformationMatrix().createAffineTransform().transform(
                    position, 0, position, 0, 1);
            return new Point2D.Double(position[0],position[1]);
        }
    }

    static class TransformedRectangle
    {
        public TransformedRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3)
        {
            this.p0 = p0;
            this.p1 = p1;
            this.p2 = p2;
            this.p3 = p3;
        }

        boolean strikesThrough(TextPosition textPosition)
        {
            Matrix matrix = textPosition.getTextPos();
            // TODO: This is a very simplistic implementation only working for horizontal text without page rotation
            // and horizontal rectangular strikeThroughs with p0 at the left bottom and p2 at the right top

            // Check if rectangle horizontally matches (at least) the text
            if (p0.getX() > matrix.getXPosition() || p2.getX() < matrix.getXPosition() + textPosition.getWidth() - textPosition.getFontSizeInPt() / 10.0)
                return false;
            // Check whether rectangle vertically is at the right height to underline
            double vertDiff = p0.getY() - matrix.getYPosition();
            if (vertDiff < 0 || vertDiff > textPosition.getFont().getFontDescriptor().getAscent() * textPosition.getFontSizeInPt() / 1000.0)
                return false;
            // Check whether rectangle is small enough to be a line
            return Math.abs(p2.getY() - p0.getY()) < 2;
        }

        boolean underlines(TextPosition textPosition)
        {
            Matrix matrix = textPosition.getTextPos();
            // TODO: This is a very simplistic implementation only working for horizontal text without page rotation
            // and horizontal rectangular underlines with p0 at the left bottom and p2 at the right top

            // Check if rectangle horizontally matches (at least) the text
            if (p0.getX() > matrix.getXPosition() || p2.getX() < matrix.getXPosition() + textPosition.getWidth() - textPosition.getFontSizeInPt() / 10.0)
                return false;
            // Check whether rectangle vertically is at the right height to underline
            double vertDiff = p0.getY() - matrix.getYPosition();
            if (vertDiff > 0 || vertDiff < textPosition.getFont().getFontDescriptor().getDescent() * textPosition.getFontSizeInPt() / 500.0)
                return false;
            // Check whether rectangle is small enough to be a line
            return Math.abs(p2.getY() - p0.getY()) < 2;
        }

        final Point2D p0, p1, p2, p3;
    }

    final List<TransformedRectangle> rectangles = new ArrayList<>();
    Set<String> currentStyle = Collections.singleton("Undefined");
}

(PDFStyledTextStripper.java)

In addition to what the PDFTextStripper does, this class also

  • collects rectangles from the content (defined using the re instruction) using an instance of the AppendRectangleToPath operator processor inner class,
  • checks text for the style variants from the sample document in determineStyle, and
  • whenever the style changes, adds the new style to the result in writeString.

Beware: This merely is a proof of concept! In particular

  • the implementations of the tests in TransformedRectangle.underlines(TextPosition) and TransformedRectangle#strikesThrough(TextPosition) are very simplistic and only work for horizontal text without page rotation and horizontal rectangular strikeThroughs and underlines with p0 at the left bottom and p2 at the right top;
  • all rectangles are collected, not checking whether they actually are filled with a visible color;
  • the tests for "bold" and "italic" merely inspect the name of the used font which may not suffice in general.

A test output

Using the PDFStyledTextStripper like this

String extractStyled(PDDocument document) throws IOException
{
    PDFTextStripper stripper = new PDFStyledTextStripper();
    stripper.setSortByPosition(true);
    return stripper.getText(document);
}

(from ExtractText.java, called from the test method testExtractStyledFromExampleDocument)

one gets the result

[]This is an example of plain text 
 
[Bold]This is an example of bold text 
[] 
[Underline]This is an example of underlined text[] 
 
[Italic]This is an example of italic text  
[] 
[StrikeThrough]This is an example of strike through text[]  
 
[Italic, Bold]This is an example of bold, italic text 

for the OP's sample document

Screenshot


PS The code of the PDFStyledTextStripper meanwhile has been slightly changed to also work for a sample document shared in a github issue, in particular the code of its inner class TransformedRectangle, cf. here.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thanks for the answer. That helped a lot. – hrzafer Feb 17 '17 at 21:54
  • Hi, is there a recommended way of detecting the text alignment information as well? I am planning to use the text position X coordinate values for that but I wanted to ask before I start. I especially need to detect center aligned text. – hrzafer May 02 '17 at 15:59
  • @hrzafer *is there a recommended way of detecting the text alignment information as well?* - I have not experimented with that yet. – mkl May 02 '17 at 23:49
  • regarding the text alignment, page margins are important. To be able to calculate margins, I make an assumption that the horizontal coordinate is a double value between 0-614 on a letter size page. Do you think that kind of assumption makes sense? – hrzafer May 10 '17 at 16:09
  • By the way, I am looking for a measurement which is independent of the resolution and will work on any machine including a server. Does the value returned from textPosition.getX() provide this? – hrzafer May 10 '17 at 16:17
  • *"I make an assumption that the horizontal coordinate is a double value between 0-614 on a letter size page. Do you think that kind of assumption makes sense?"* - That assumption may hold often but not always. Each PDF page defines its own range of coordinates for the visible area by its **MediaBox** and **CropBox** (the latter one defaulting to the former's value). Merely by lazyness of PDF generators the lower left is corner of the page is the origin. And the width may also vary depending of the **UserUnit** setting of the page... – mkl May 10 '17 at 21:04
  • I simply use the height and width values coming from the CropBox and divide the x values by the width and y values by the height. Which returns a value in [0-1]. This value gives how close a position to the edges. For example, this value is usually 0.95+ for footer text. So far I use this value to remove the header/footer text out and it works for most of the cases. – hrzafer May 11 '17 at 21:49
  • *"it works for most of the cases"* - well, most cases have the origin in our near the lower left corner, for them that should work. Some have not, for them that doesn't work. If you want code to work in all cases, also use the offset of the lower left corner from the origin. – mkl May 12 '17 at 04:07