4

As a newbie of pdfbox 2.0.2 (https://github.com/apache/pdfbox/tree/2.0.2) user, I would like to get all the stroked lines (for instance, column and row borders of a table) of a page (PDPage), and thus I created the following class: package org.apache.pdfbox.rendering;

import java.awt.geom.GeneralPath;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URI;

import org.apache.commons.io.IOUtils;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.apache.pdfbox.rendering.PageDrawer;
import org.apache.pdfbox.rendering.PageDrawerParameters;

public class LineCatcher {
    private PageDrawer pageDrawer;
    private PDDocument document;
    private PDFRenderer pdfRenderer;
    private PDPage page;

    public LineCatcher(URI pdfSrcURI) throws IllegalArgumentException, 
        MalformedURLException, IOException {
        this.document = PDDocument.load(IOUtils.toByteArray(pdfSrcURI));
        this.pdfRenderer = new PDFRenderer(this.document);
    }
    public GeneralPath getLinePath(int pageIndex) throws IOException {
        this.page = this.document.getPage(pageIndex);
        PageDrawerParameters parameters = new PageDrawerParameters (this.pdfRenderer, this.page);
        this.pageDrawer = new PageDrawer(parameters);
        this.pageDrawer.processPage(this.page); //catches exception here
        return this.pageDrawer.getLinePath();
    }
}

According to my understanding, in order to get the line path of a page, the page has to be processed first, so I called the method processPage in the line, where I marked "catch exception here". It caught NullPointer Excpetions int the mentioned line unexpectedly. The exception info are the following:

java.lang.NullPointerException
  at org.apache.pdfbox.rendering.PageDrawer.fillPath(PageDrawer.java:599)
  at org.apache.pdfbox.contentstream.operator.graphics.FillNonZeroRule.process(FillNonZeroRule.java:36)
  at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:815)
  at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:472)
  at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:446)
  at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
  at org.apache.pdfbox.rendering.LineCatcher.getLinePath(LineCatcher.java:33)
  at org.apache.pdfbox.rendering.TestLineCatcher.testGetLinePath(TestLineCatcher.java:21)

Is there anyone, who can give some advice about my logic or help to debug the code? Thanks in advance

Rui
  • 3,454
  • 6
  • 37
  • 70
  • 1
    It's definitely wrong... getLinePath() is to get the current line path while processing the page. It is reset to empty after each fill/stroke. It is NOT what you think, i.e. a path with all the lines of a page. I'll see if I can come up with something better, e.g. catch the stroke operator. – Tilman Hausherr Aug 13 '16 at 12:33

1 Answers1

5

Extending PageDrawer didn't really work, so I extended PDFGraphicsStreamEngine and here's the result. I do some of the stuff that is done in PageDrawer. To collect lines, either evaluate the shape in strokePath(), or collect points and lines in the other methods where I have included a println.

public class LineCatcher extends PDFGraphicsStreamEngine
{
    private final GeneralPath linePath = new GeneralPath();
    private int clipWindingRule = -1;

    public LineCatcher(PDPage page)
    {
        super(page);
    }

    public static void main(String[] args) throws IOException
    {
        try (PDDocument document = PDDocument.load(new File("Test.pdf")))
        {
            PDPage page = document.getPage(0);
            LineCatcher test = new LineCatcher(page);
            test.processPage(page);
        }
    }

    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException
    {
        System.out.println("appendRectangle");
        // to ensure that the path is created in the right direction, we have to create
        // it by combining single lines instead of creating a simple rectangle
        linePath.moveTo((float) p0.getX(), (float) p0.getY());
        linePath.lineTo((float) p1.getX(), (float) p1.getY());
        linePath.lineTo((float) p2.getX(), (float) p2.getY());
        linePath.lineTo((float) p3.getX(), (float) p3.getY());

        // close the subpath instead of adding the last line so that a possible set line
        // cap style isn't taken into account at the "beginning" of the rectangle
        linePath.closePath();
    }

    @Override
    public void drawImage(PDImage pdi) throws IOException
    {
    }

    @Override
    public void clip(int windingRule) throws IOException
    {
        // the clipping path will not be updated until the succeeding painting operator is called
        clipWindingRule = windingRule;

    }

    @Override
    public void moveTo(float x, float y) throws IOException
    {
        linePath.moveTo(x, y);
        System.out.println("moveTo");
    }

    @Override
    public void lineTo(float x, float y) throws IOException
    {
        linePath.lineTo(x, y);
        System.out.println("lineTo");
    }

    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException
    {
        linePath.curveTo(x1, y1, x2, y2, x3, y3);
        System.out.println("curveTo");
    }

    @Override
    public Point2D getCurrentPoint() throws IOException
    {
        return linePath.getCurrentPoint();
    }

    @Override
    public void closePath() throws IOException
    {
        linePath.closePath();
    }

    @Override
    public void endPath() throws IOException
    {
        if (clipWindingRule != -1)
        {
            linePath.setWindingRule(clipWindingRule);
            getGraphicsState().intersectClippingPath(linePath);
            clipWindingRule = -1;
        }
        linePath.reset();

    }

    @Override
    public void strokePath() throws IOException
    {
        // do stuff
        System.out.println(linePath.getBounds2D());

        linePath.reset();
    }

    @Override
    public void fillPath(int windingRule) throws IOException
    {
        linePath.reset();
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException
    {
        linePath.reset();
    }

    @Override
    public void shadingFill(COSName cosn) throws IOException
    {
    }
}

Update 19.3.2019: See also follow-up answer by mkl here.

Tilman Hausherr
  • 17,731
  • 7
  • 58
  • 97
  • Thanks a lot for your great help :) May I ask what is the reason for the linePath to be `final` ? – Rui Aug 14 '16 at 11:55
  • It's a recommendation by Netbeans, because it is never overwritten. It is not really important. – Tilman Hausherr Aug 14 '16 at 11:58
  • Ahaa, so similar with a `List`? As I remember list should usu be initialized as final. and what is the `wìndingrule`? Seems that it does not matter so much, but would be better to get something about it. Moreover, there should be a method returning the `linePath`, right? – Rui Aug 14 '16 at 12:01
  • re windingrule, please look into the PDF 32000 specification (google for that). It is probably not relevant for you because you're only interested in lines, not in filled shapes. Re returning the linepath, you could change the code to replace linePath with a List and then add the linePath in the stroke method. My answer is meant to explain how to get the raw stuff, you need to adjust it to what you are trying to do with tables. If you want I can change my answer to what I just mentioned. Just say so and I'll do it. – Tilman Hausherr Aug 14 '16 at 13:20
  • Wonderful and great help <3 As such I made my learning process faster :) maybe there is still problems when writing my custom code, but currently so far so good! Bingo! – Rui Aug 14 '16 at 17:31
  • Still I am not clear about in which case closePath is called, and in which case endPath called instead of closePath. – Rui Aug 15 '16 at 07:48
  • closepath is the "h" operator. "Close the current subpath by appending a straight line segment from the current point to the starting point of the subpath." endpath is the "n" operator, which ends the current path without filling or stroking it, but updates the clipping path, so it is not really relevant to you unless tables are really modified by clipping. To see more what happens in documents, use PDFDebugger and look at the "contents" part. In the PDF 32000 documentation, have a look at the segment "operator summary", maybe print the 3 pages and tape them to your wall :-) – Tilman Hausherr Aug 15 '16 at 08:32
  • 1
    In methods - strokePath(), fillPath(), and endPath(), the method GeneralPath.reset() is called. Why they are call just in these 3 methods? I still can not get the logic for it. Moreover, if I would like to add GeneralPath to a list, I have to replace the GeneralPath.reset call with a new GeneralPath call, right? – Rui Aug 15 '16 at 12:26
  • The logic why reset() is called is because with these, the path is either stroked, filled or no draw (but added to clipping). Yes, if you want a list of paths, you'd need to create a new empty path (if you don't, it would be the same object) instead of reset(). – Tilman Hausherr Aug 15 '16 at 13:05
  • Got more question here :) http://stackoverflow.com/questions/38962072/pdfbox-2-0-2-how-to-combine-the-textposition-coordinates-and-graphics-generalp – Rui Aug 15 '16 at 19:55
  • I've used this code that works fine with PDF containing 1 rectangle. What about PDF with 2 or more rectangles or others forms ? – BartmanDilaw Jul 07 '20 at 08:45
  • It should work as well. You need to customize this code to your needs. – Tilman Hausherr Jul 07 '20 at 11:25
  • @TilmanHausherr : I mean I've tried this code with a 2 rectangles PDF but just one of the two is shown in strokePath function... I just began with Java programming that makes this overridden Java code too tricky for me.Where (function) and what (structure/class) should I loop to get all the Rectangles ? Thanks ! – BartmanDilaw Jul 07 '20 at 11:43
  • Maybe one of them is an image? Or both rectangles are painted in a single stroke (but appendRectangle is called twice)? Please share that PDF. Better create a new question, include the code that you used. – Tilman Hausherr Jul 07 '20 at 12:13
  • @TilmanHausherr The PDF has been simply created with Java program using 2 add Rectangle to content stream (Cf. https://stackoverflow.com/questions/62702167/java-how-to-read-pdrectangle-coordinates-from-pdf-document)... This page has 2 java code : one to create a PDF with one rectangle, the second to read it... I just had the first java program "contentStream.addRect(100, 400, 50, 50);" to create a second rectangle... You can get the code there... Thanks again. – BartmanDilaw Jul 07 '20 at 12:51
  • That means you added 2 rectangles to the path, but stroke only once (which is fine). Your path contains both rectangles. – Tilman Hausherr Jul 07 '20 at 13:23
  • @TilmanHausherr Ok, but how to show the second one's position ? How to loop the rectangles into the path ? – BartmanDilaw Jul 07 '20 at 13:35
  • OK @TilmanHausherr... Got it. used the wrong PDF file. Sorry :-) – BartmanDilaw Jul 07 '20 at 14:14