IText reading PDF like pdftotext -layout?

Question

Im looking for the easiest way to implement a java solution which is quiet similar to the output of

pdftotext -layout FILE

on linux machines. (And of course it should be cheap as well)

I just tried some code snippets of IText, PDFBox and PDFTextStream. The most accurate solution so far is PDFTextStream which uses the VisualOutputTarget to get a great representation of my file.

So my column layout is recognized correct and I'm able to work with it. But there should be also a solution for IText, or?

Every easy snippet I found produces plain ordered strings which are a mess (mess up row/column/lines). Is there any solution which might be easier and may not involve a own Strategy? Or is there a open Source strategy which i can use?

// I followed the instructions of mkl and have written and own strategy object as follows:

package com.test.pdfextractiontest.itext;

import ...


public class MyLocationTextExtractionStrategy implements TextExtractionStrategy {

    /** set to true for debugging */
    static boolean DUMP_STATE = false;

    /** a summary of all found text */
    private final List<TextChunk> locationalResult = new ArrayList<TextChunk>();


    public MyLocationTextExtractionStrategy() {
    }


    @Override
    public void beginTextBlock() {
    }


    @Override
    public void endTextBlock() {
    }

    private boolean startsWithSpace(final String str) {
        if (str.length() == 0) {
            return false;
        }
        return str.charAt(0) == ' ';
    }


    private boolean endsWithSpace(final String str) {
        if (str.length() == 0) {
            return false;
        }
        return str.charAt(str.length() - 1) == ' ';
    }

    private List<TextChunk> filterTextChunks(final List<TextChunk> textChunks, final TextChunkFilter filter) {
        if (filter == null) {
            return textChunks;
        }

        final List<TextChunk> filtered = new ArrayList<TextChunk>();
        for (final TextChunk textChunk : textChunks) {
            if (filter.accept(textChunk)) {
                filtered.add(textChunk);
            }
        }
        return filtered;
    }


    protected boolean isChunkAtWordBoundary(final TextChunk chunk, final TextChunk previousChunk) {
        final float dist = chunk.distanceFromEndOf(previousChunk);

        if (dist < -chunk.getCharSpaceWidth() || dist > chunk.getCharSpaceWidth() / 2.0f) {
            return true;
        }

        return false;
    }

    public String getResultantText(final TextChunkFilter chunkFilter) {
        if (DUMP_STATE) {
            dumpState();
        }

        final List<TextChunk> filteredTextChunks = filterTextChunks(this.locationalResult, chunkFilter);
        Collections.sort(filteredTextChunks);

        final StringBuffer sb = new StringBuffer();
        TextChunk lastChunk = null;
        for (final TextChunk chunk : filteredTextChunks) {

            if (lastChunk == null) {
                sb.append(chunk.text);
            } else {
                if (chunk.sameLine(lastChunk)) {

                    if (isChunkAtWordBoundary(chunk, lastChunk) && !startsWithSpace(chunk.text)
                            && !endsWithSpace(lastChunk.text)) {
                        sb.append(' ');
                    }
                    final Float dist = chunk.distanceFromEndOf(lastChunk)/3;
                    for(int i = 0; i<Math.round(dist); i++) {
                        sb.append(' ');
                    }
                    sb.append(chunk.text);
                } else {
                    sb.append('\n');
                    sb.append(chunk.text);
                }
            }
            lastChunk = chunk;
        }

        return sb.toString();
    }

eturn a String with the resulting text. */ @Override public String getResultantText() {

        return getResultantText(null);

    }

    private void dumpState() {
        for (final TextChunk location : this.locationalResult) {
            location.printDiagnostics();

            System.out.println();
        }

    }


    @Override
    public void renderText(final TextRenderInfo renderInfo) {
        LineSegment segment = renderInfo.getBaseline();
        if (renderInfo.getRise() != 0) { 

            final Matrix riseOffsetTransform = new Matrix(0, -renderInfo.getRise());
            segment = segment.transformBy(riseOffsetTransform);
        }
        final TextChunk location =
                new TextChunk(renderInfo.getText(), segment.getStartPoint(), segment.getEndPoint(),
                        renderInfo.getSingleSpaceWidth(),renderInfo);
        this.locationalResult.add(location);
    }

    public static class TextChunk implements Comparable<TextChunk> {
        /** the text of the chunk */
        private final String text;
        /** the starting location of the chunk */
        private final Vector startLocation;
        /** the ending location of the chunk */
        private final Vector endLocation;
        /** unit vector in the orientation of the chunk */
        private final Vector orientationVector;
        /** the orientation as a scalar for quick sorting */
        private final int orientationMagnitude;

        private final TextRenderInfo info;

        private final int distPerpendicular;

        private final float distParallelStart;

        private final float distParallelEnd;
        /** the width of a single space character in the font of the chunk */
        private final float charSpaceWidth;

        public TextChunk(final String string, final Vector startLocation, final Vector endLocation,
                final float charSpaceWidth,final TextRenderInfo ri) {
            this.text = string;
            this.startLocation = startLocation;
            this.endLocation = endLocation;
            this.charSpaceWidth = charSpaceWidth;

            this.info = ri;

            Vector oVector = endLocation.subtract(startLocation);
            if (oVector.length() == 0) {
                oVector = new Vector(1, 0, 0);
            }
            this.orientationVector = oVector.normalize();
            this.orientationMagnitude =
                    (int) (Math.atan2(this.orientationVector.get(Vector.I2), this.orientationVector.get(Vector.I1)) * 1000);

            final Vector origin = new Vector(0, 0, 1);
            this.distPerpendicular = (int) startLocation.subtract(origin).cross(this.orientationVector).get(Vector.I3);

            this.distParallelStart = this.orientationVector.dot(startLocation);
            this.distParallelEnd = this.orientationVector.dot(endLocation);
        }

        public Vector getStartLocation() {
            return this.startLocation;
        }


        public Vector getEndLocation() {
            return this.endLocation;
        }


        public String getText() {
            return this.text;
        }

        public float getCharSpaceWidth() {
            return this.charSpaceWidth;
        }

        private void printDiagnostics() {
            System.out.println("Text (@" + this.startLocation + " -> " + this.endLocation + "): " + this.text);
            System.out.println("orientationMagnitude: " + this.orientationMagnitude);
            System.out.println("distPerpendicular: " + this.distPerpendicular);
            System.out.println("distParallel: " + this.distParallelStart);
        }


        public boolean sameLine(final TextChunk as) {
            if (this.orientationMagnitude != as.orientationMagnitude) {
                return false;
            }
            if (this.distPerpendicular != as.distPerpendicular) {
                return false;
            }
            return true;
        }


        public float distanceFromEndOf(final TextChunk other) {
            final float distance = this.distParallelStart - other.distParallelEnd;
            return distance;
        }

        public float myDistanceFromEndOf(final TextChunk other) {
            final float distance = this.distParallelStart - other.distParallelEnd;
            return distance;
        }


        @Override
        public int compareTo(final TextChunk rhs) {
            if (this == rhs) {
                return 0; // not really needed, but just in case
            }

            int rslt;
            rslt = compareInts(this.orientationMagnitude, rhs.orientationMagnitude);
            if (rslt != 0) {
                return rslt;
            }

            rslt = compareInts(this.distPerpendicular, rhs.distPerpendicular);
            if (rslt != 0) {
                return rslt;
            }

            return Float.compare(this.distParallelStart, rhs.distParallelStart);
        }

        private static int compareInts(final int int1, final int int2) {
            return int1 == int2 ? 0 : int1 < int2 ? -1 : 1;
        }


        public TextRenderInfo getInfo() {
            return this.info;
        }

    }


    @Override
    public void renderImage(final ImageRenderInfo renderInfo) {
        // do nothing
    }


    public static interface TextChunkFilter {

        public boolean accept(TextChunk textChunk);
    }


}

As you can see most is the same as the original class. i just added this :

                final Float dist = chunk.distanceFromEndOf(lastChunk)/3;
                for(int i = 0; i<Math.round(dist); i++) {
                    sb.append(' ');
                }

to the getResultantText Method to extend the gaps with spaces. But here is the problem:

the distance seems to be inaccurate or inexact. the result looks like

this: this:

does anyone have an idea how to calculate a better or value for the distance? i think its because the original font type is ArialMT and my editor is in courier, but to work with this sheet its recommended that i can split the table on the correct place to get my data. thats difficult due the floating start and end of an value usw.

:-/

That should be fairly easy to implement by copying `LocationTextExtractionStrategy` and changing its method `getResultantText(TextChunkFilter)` a bit. Unfortunately necessary data are private in that class. Thus, deriving from it won't work (without reflection, that is). — mkl, Jul 22 '14 at 13:03
mhh i just tried the standard LocationTextExtractionStrategy and is just a mess too :-X , first of all everything is in reverse, its full of whitespaces on wrong places, but okay... its not messed up in line ordering. :D — Smoki, Jul 22 '14 at 14:06
*everything is in reverse* - that would surprise, it explicitly orders. Or do you use some RTL writing? *full of whitespaces on wrong places* - I would assume there are too few whitespaces to represent the original formatting, but too many? Can you share the PDF file in question? — mkl, Jul 22 '14 at 14:38
Unfortunately not. Some classified data ;) But i got it working normally now,... it was just an old version of IText in our Sonatype nexus. But now how get I more spaces between text chunks if they are on a same line and are indented? — Smoki, Jul 23 '14 at 10:09
As mentioned you have to edit your copy of `LocationTextExtractionStrategy`, more exactly its method `getResultantText(TextChunkFilter)`, to not only insert a single space for chunks on the same line separated by empty space; instead the number of spaces needs to fill a large enough gap. — mkl, Jul 23 '14 at 10:16
thanks mkl, i did it like you said and have now a little more work to do and didn't get it by myself till now. any new idea ? ;-) thanks in advance — Smoki, Jul 23 '14 at 12:21

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

The problem with your approach inserting spaces like this

            final Float dist = chunk.distanceFromEndOf(lastChunk)/3;
            for(int i = 0; i<Math.round(dist); i++) {
                sb.append(' ');
            }

is that it assumes that the current position in the StringBuffer exactly corresponds to the end of lastChunk assuming a character width width of 3 user space units. This needs not be the case, generally each addition of characters destroys such a former correspondence. E.g. these two lines have way different widths when using a proportional font:

ililili

MWMWMWM

while in a StringBuffer they occupy the same length.

Thus, you have to look where chunk starts in relation to the left page border and add spaces to the buffer accordingly.

Furthermore your code completely ignores free space at the start of lines.

Your results should improve if you replace the original method getResultantText(TextChunkFilter by this code instead:

public String getResultantText(TextChunkFilter chunkFilter){
    if (DUMP_STATE) dumpState();
    
    List<TextChunk> filteredTextChunks = filterTextChunks(locationalResult, chunkFilter);
    Collections.sort(filteredTextChunks);

    int startOfLinePosition = 0;
    StringBuffer sb = new StringBuffer();
    TextChunk lastChunk = null;
    for (TextChunk chunk : filteredTextChunks) {

        if (lastChunk == null){
            insertSpaces(sb, startOfLinePosition, chunk.distParallelStart, false);
            sb.append(chunk.text);
        } else {
            if (chunk.sameLine(lastChunk))
            {
                if (isChunkAtWordBoundary(chunk, lastChunk))
                {
                    insertSpaces(sb, startOfLinePosition, chunk.distParallelStart, !startsWithSpace(chunk.text) && !endsWithSpace(lastChunk.text));
                }
                
                sb.append(chunk.text);
            } else {
                sb.append('\n');
                startOfLinePosition = sb.length();
                insertSpaces(sb, startOfLinePosition, chunk.distParallelStart, false);
                sb.append(chunk.text);
            }
        }
        lastChunk = chunk;
    }

    return sb.toString();       
}

void insertSpaces(StringBuffer sb, int startOfLinePosition, float chunkStart, boolean spaceRequired)
{
    int indexNow = sb.length() - startOfLinePosition;
    int indexToBe = (int)((chunkStart - pageLeft) / fixedCharWidth);
    int spacesToInsert = indexToBe - indexNow;
    if (spacesToInsert < 1 && spaceRequired)
        spacesToInsert = 1;
    for (; spacesToInsert > 0; spacesToInsert--)
    {
        sb.append(' ');
    }
}

public float pageLeft = 0;
public float fixedCharWidth = 6;

pageLeft is the coordinate of the left page border. The strategy does not know it and, therefore, must be told explicitly; in many cases, though, 0 is the correct value.

Alternatively one could use the minimum distParallelStart value of all chunks. This would cut off the left margin but would not require you to inject the exact left page border value.

fixedCharWidth is the assumed character width. Depending on the writing in the PDF in question a different value might be more apropos. In your case a value of 3 seems to be better than my 6.

There still is a lot of room for improvement in this code. E.g.

It assumes that there are no text chunks spanning multiple table columns. This assumption very often is correct, but I have seen weird PDFs in which the normal inter-word spacing has been implemented using separate text chunks at some offset but the inter-column spacing was represented by a single space character in a single chunk (spanning the end of one column and the start of the next)! The width of that space character has been manipulated by the word-spacing setting of the PDF graphics state.
It ignores different amounts of vertical space.

IText reading PDF like pdftotext -layout?

1 Answers1

Linked