PDFBox - Line / Rectangle extraction

Question

I am trying to extract text coordinates and line (or rectangle) coordinates from a PDF.

The TextPosition class has getXDirAdj() and getYDirAdj() methods which transform coordinates according to the direction of the text piece the respective TextPosition object represents (Corrected based on comment from @mkl) The final output is consistent, irrespective of the page rotation.

The coordinates needed on the output are X0,Y0 (TOP LEFT CORNER OF THE PAGE)

This is a slight modification from the solution by @Tilman Hausherr. The y coordinates are inverted (height - y) to keep it consistent with the coordinates from the text extraction process, also the output is written to a csv.

    public class LineCatcher extends PDFGraphicsStreamEngine
{
    private static final GeneralPath linePath = new GeneralPath();
    private static ArrayList<Rectangle2D> rectList= new ArrayList<Rectangle2D>();
    private int clipWindingRule = -1;
    private static String headerRecord = "Text|Page|x|y|width|height|space|font";

    public LineCatcher(PDPage page)
    {
        super(page);
    }

    public static void main(String[] args) throws IOException
    {
        if( args.length != 4 )
        {
            usage();
        }
        else
        {
            PDDocument document = null;
            FileOutputStream fop = null;
            File file;
            Writer osw = null;
            int numPages;
            double page_height;
            try
            {
                document = PDDocument.load( new File(args[0], args[1]) );
                numPages = document.getNumberOfPages();
                file = new File(args[2], args[3]);
                fop = new FileOutputStream(file);

                // if file doesnt exists, then create it
                if (!file.exists()) {
                    file.createNewFile();
                }

                osw = new OutputStreamWriter(fop, "UTF8");
                osw.write(headerRecord + System.lineSeparator());
                System.out.println("Line Processing numPages:" + numPages);
                for (int n = 0; n < numPages; n++) {
                    System.out.println("Line Processing page:" + n);
                    rectList = new ArrayList<Rectangle2D>();
                    PDPage page = document.getPage(n);
                    page_height = page.getCropBox().getUpperRightY();
                    LineCatcher lineCatcher = new LineCatcher(page);
                    lineCatcher.processPage(page);

                    try{
                        for(Rectangle2D rect:rectList) {

                            String pageNum = Integer.toString(n + 1);
                            String x = Double.toString(rect.getX());
                            String y = Double.toString(page_height - rect.getY()) ;
                            String w = Double.toString(rect.getWidth());
                            String h = Double.toString(rect.getHeight());
                            writeToFile(pageNum, x, y, w, h, osw);

                        }
                        rectList = null;
                        page = null;
                        lineCatcher = null;
                    }
                    catch(IOException io){
                        throw new IOException("Failed to Parse document for line processing. Incorrect document format. Page:" + n);
                    }
                };

            }
            catch(IOException io){
                throw new IOException("Failed to Parse document for line processing. Incorrect document format.");
            }
            finally
            {
                if ( osw != null ){
                    osw.close();
                }
                if( document != null )
                {
                    document.close();
                }
            }
        }
    }

    private static void writeToFile(String pageNum, String x, String y, String w, String h, Writer osw) throws IOException {
        String c = "^" + "|" +
                pageNum + "|" +
                x + "|" +
                y + "|" +
                w + "|" +
                h + "|" +
                "999" + "|" +
                "marker-only";
        osw.write(c + System.lineSeparator());
    }

    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException
    {
        // to ensure that the path is created in the right direction, we have to create
        // it by combining single lines instead of creating a simple rectangle
        linePath.moveTo((float) p0.getX(), (float) p0.getY());
        linePath.lineTo((float) p1.getX(), (float) p1.getY());
        linePath.lineTo((float) p2.getX(), (float) p2.getY());
        linePath.lineTo((float) p3.getX(), (float) p3.getY());

        // close the subpath instead of adding the last line so that a possible set line
        // cap style isn't taken into account at the "beginning" of the rectangle
        linePath.closePath();
    }

    @Override
    public void drawImage(PDImage pdi) throws IOException
    {
    }

    @Override
    public void clip(int windingRule) throws IOException
    {
        // the clipping path will not be updated until the succeeding painting operator is called
        clipWindingRule = windingRule;

    }

    @Override
    public void moveTo(float x, float y) throws IOException
    {
        linePath.moveTo(x, y);
    }

    @Override
    public void lineTo(float x, float y) throws IOException
    {
        linePath.lineTo(x, y);
    }

    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException
    {
        linePath.curveTo(x1, y1, x2, y2, x3, y3);
    }

    @Override
    public Point2D getCurrentPoint() throws IOException
    {
        return linePath.getCurrentPoint();
    }

    @Override
    public void closePath() throws IOException
    {
        linePath.closePath();
    }

    @Override
    public void endPath() throws IOException
    {
        if (clipWindingRule != -1)
        {
            linePath.setWindingRule(clipWindingRule);
            getGraphicsState().intersectClippingPath(linePath);
            clipWindingRule = -1;
        }
        linePath.reset();

    }

    @Override
    public void strokePath() throws IOException
    {
        rectList.add(linePath.getBounds2D());
        linePath.reset();
    }

    @Override
    public void fillPath(int windingRule) throws IOException
    {
        linePath.reset();
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException
    {
        linePath.reset();
    }

    @Override
    public void shadingFill(COSName cosn) throws IOException
    {
    }

    /**
     * This will print the usage for this document.
     */
    private static void usage()
    {
        System.err.println( "Usage: java " + LineCatcher.class.getName() + " <input-pdf>"  + " <output-file>");
    }
}

Was using the PDFGraphicsStreamEngine class to extract Line and Rectangle coordinates. The coordinates of lines and rectangles do not align with the coordinates of the text

Green: Text Red: Line coordinates obtained as is Black: Expected coordinates (Obtained after applying transformation on the output)

Tried the setRotation() method to correct for the rotation before running the line extract. However the results are not consistent.

What are the possible options to get the rotation and get a consistent output of the Line / Rectangle coordinates using PDFBox?

See https://stackoverflow.com/questions/38931422/pdfbox-2-0-2-calling-of-pagedrawer-processpage-method-caught-exceptions and do also check whether there is a cropbox in the page. Also share some minimal working code (leave off the text part) and the PDF. — Tilman Hausherr, Mar 15 '19 at 10:42
I still need a PDF to work with. If you can't share the troublesome one, try searching for a public PDF. Alternatively, look in the source code of PDFBox, at PDFRenderer and PageDrawer. There are some affine transforms done due to rotation, cropbox and because the PDF coordinates start at the bottom. — Tilman Hausherr, Mar 16 '19 at 05:27
Indeed, the pdf is needed. Also please clearly characterize the coordinate system you want. In your question you mention *`getXDirAdj()` and `getYDirAdj()` method* but these methods transform coordinates according to the direction of the text piece the respective `TextPosition` object represents! Thus, on one hand this is not directly applicable to non-textual content like lines or rectangles (which don't have a text writing direction), and on the other hand, if there is text in different directions on a page you get text coordinates relative to different coordinate systems in your results! — mkl, Mar 16 '19 at 10:17
In case of tables sometimes column headers are drawn at an right angle, so *text in different directions on a page* is quite a realistic use case to consider... — mkl, Mar 16 '19 at 10:19
Couldn't find a public source, But created a PDF which has a similar issue with the line orientation: https://github.com/yashodhan19/PDF/blob/master/LineRotationTest.pdf — Yashodhan Joglekar, Mar 17 '19 at 04:35
Tried applying some of the rotation methods suggested by @mkl https://stackoverflow.com/questions/40611736/rotate-pdf-around-its-center-using-pdfbox-in-java The line coordinates still are not consistent. Haven't encountered cases with rotated text, But yes the use case is realistic — Yashodhan Joglekar, Mar 17 '19 at 04:45
Ok, in case of your example file there is a clockwise 90° page rotation entry and the text is drawn with a text matrix rotated 90° anti-clockwise. The origin of the PDF coordinate system is in the lower left of the unrotated page. Now please also clearly characterize the coordinate system you want. — mkl, Mar 17 '19 at 06:24
The rotation methods from my older answer are useful if you need to rotate the whole page content but you only want to find some coordinates, don't you? What actually do you need the coordinates for? That would probably imply the adequate coordinate system for you. — mkl, Mar 17 '19 at 06:32
@mkl for the final output I need the origin (X0 and Y0) to be on the top left corner of the page. My text extraction process gives me an output with the same coordinate system (top left corner origin) — Yashodhan Joglekar, Mar 17 '19 at 21:24
I have a process where I extract the characters and build words/ sentences based on custom logic. In cases where the text is too close I need the line coordinates to break the word at that point. In cases where it is enclosed by a box, I use it to build the sentence. By coordinate system - do you need the origin or is there something else? — Yashodhan Joglekar, Mar 17 '19 at 21:27
See my answer. If the resulting coordinates still differ from your expectation, please supply your expected line datasets for some lines using concrete numbers. — mkl, Mar 18 '19 at 14:38

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

As far as I understand the requirements here, the OP works in a coordinate system with the origin in the upper left corner of the visible page (taking the page rotation into account), x coordinates increasing to the right, y coordinates increasing downwards, and the units being the PDF default user space units (usually ¹/₇₂ inch).

In this coordinate system he needs to extract (horizontal or vertical) lines in the form of

coordinates of the left / top end point and
the width / height.

Transforming `LineCatcher` results

The helper class LineCatcher he got from Tilman, on the other hand, does not take page rotation into account. Furthermore, it returns the bottom end point for vertical lines, not the top end point. Thus, a coordinate transformation has to be applied to of the LineCatcher results.

For this simply replace

for(Rectangle2D rect:rectList) {
    String pageNum = Integer.toString(n + 1);
    String x = Double.toString(rect.getX());
    String y = Double.toString(page_height - rect.getY()) ;
    String w = Double.toString(rect.getWidth());
    String h = Double.toString(rect.getHeight());
    writeToFile(pageNum, x, y, w, h, osw);
}

by

int pageRotation = page.getRotation();
PDRectangle pageCropBox = page.getCropBox();

for(Rectangle2D rect:rectList) {
    String pageNum = Integer.toString(n + 1);
    String x, y, w, h;
    switch(pageRotation) {
    case 0:
        x = Double.toString(rect.getX() - pageCropBox.getLowerLeftX());
        y = Double.toString(pageCropBox.getUpperRightY() - rect.getY() + rect.getHeight());
        w = Double.toString(rect.getWidth());
        h = Double.toString(rect.getHeight());
        break;
    case 90:
        x = Double.toString(rect.getY() - pageCropBox.getLowerLeftY());
        y = Double.toString(rect.getX() - pageCropBox.getLowerLeftX());
        w = Double.toString(rect.getHeight());
        h = Double.toString(rect.getWidth());
        break;
    case 180:
        x = Double.toString(pageCropBox.getUpperRightX() - rect.getX() - rect.getWidth());
        y = Double.toString(rect.getY() - pageCropBox.getLowerLeftY());
        w = Double.toString(rect.getWidth());
        h = Double.toString(rect.getHeight());
        break;
    case 270:
        x = Double.toString(pageCropBox.getUpperRightY() - rect.getY() + rect.getHeight());
        y = Double.toString(pageCropBox.getUpperRightX() - rect.getX() - rect.getWidth());
        w = Double.toString(rect.getHeight());
        h = Double.toString(rect.getWidth());
        break;
    default:
        throw new IOException(String.format("Unsupported page rotation %d on page %d.", pageRotation, page));
    }
    writeToFile(pageNum, x, y, w, h, osw);
}

(ExtractLinesWithDir test testExtractLineRotationTestWithDir)

Relation to `TextPosition.get?DirAdj()` coordinates

The OP describes the coordinates by referring to the TextPosition class methods getXDirAdj() and getYDirAdj(). Indeed, these methods return coordinates in a coordinate system with the origin in the upper left page corner and y coordinates increasing downwards after rotating the page so that the text is drawn upright.

In case of the example document all the text is drawn so that it is upright after applying the page rotation. From this my understanding of the requirement written at the top has been derived.

The problem with using the TextPosition.get?DirAdj() values as coordinates globally, though, is that in documents with pages with text drawn in different directions, the collected text coordinates suddenly are relative to different coordinate systems. Thus, a general solution should not collect coordinates wildly like that. Instead it should determine a page orientation at first (e.g. the orientation given by the page rotation or the orientation shared by most of the text) and use coordinates in the fixed coordinate system given by that orientation plus an indication of the writing direction of the text piece in question.

Thank you, this works great. also Thank you for pointing out the issue with TextPosition.get?DirAdj() will make changes to the text extraction process per your suggestion. — Yashodhan Joglekar, Mar 18 '19 at 16:14

PDFBox - Line / Rectangle extraction

1 Answers1

Transforming `LineCatcher` results

Relation to `TextPosition.get?DirAdj()` coordinates

Linked

PDFBox - Line / Rectangle extraction

1 Answers1

Transforming LineCatcher results

Relation to TextPosition.get?DirAdj() coordinates

Linked

Transforming `LineCatcher` results

Relation to `TextPosition.get?DirAdj()` coordinates