3

I'm trying to extract text from a PDF which is full of tables. In some cases, a column is empty. When I extract the text from the PDF, the emptys columns are skiped and replaced by a whitespace, therefore, my regulars expressions can't figure out that there was a column with no information at this spot.

Image to a better understanding :

Image of PDF source and extracted text

We can see that the columns aren't respected in the extracted text

Sample of my code that extract the text from PDF :

PDFTextStripper reader = new PDFTextStripper();
            reader.setSortByPosition(true);
            reader.setStartPage(page);
            reader.setEndPage(page);
            String st = reader.getText(document);
            List<String> lines = Arrays.asList(st.split(System.getProperty("line.separator")));

How to maintain the full structure of the original PDF when extracting text from it ?

Thank's a lot.

Leor
  • 51
  • 1
  • 6
  • 1
    Try a tool like tabula java, that is on top of PDFBox. PDFBox doesn't try identify tables. – Tilman Hausherr Aug 23 '17 at 13:33
  • Leor, if a variant of the `PDFTextStripper` is of interest to you which attempts to insert extra spaces where in the PDF there is a big gap, I'll copy [the answer I once gave to a meanwhile deleted question](https://stackoverflow.com/a/28370692/1729265) with just such a variant. – mkl Aug 23 '17 at 13:50
  • @mkl Your solution might be helpful. If the extra spaces added are always the same (in term of number of characters) it can do the job. – Leor Aug 23 '17 at 14:04

1 Answers1

2

(This originally was the answer (dated Feb 6 '15) to another question which the OP deleted including all answers. Due to the age, the code in the answer was still based on PDFBox 1.8.x, so some changes might be necessary to make it run with PDFBox 2.0.x.)

In comments the OP showed interest in a solution to extend the PDFBox PDFTextStripper to return text lines which attempt to reflect the PDF file layout which might help in case of the question at hand.

A proof-of-concept for that would be this class:

public class LayoutTextStripper extends PDFTextStripper
{
    public LayoutTextStripper() throws IOException
    {
        super();
    }

    @Override
    protected void startPage(PDPage page) throws IOException
    {
        super.startPage(page);
        cropBox = page.findCropBox();
        pageLeft = cropBox.getLowerLeftX();
        beginLine();
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        float recentEnd = 0;
        for (TextPosition textPosition: textPositions)
        {
            String textHere = textPosition.getCharacter();
            if (textHere.trim().length() == 0)
                continue;

            float start = textPosition.getTextPos().getXPosition();
            boolean spacePresent = endsWithWS | textHere.startsWith(" ");

            if (needsWS | spacePresent | Math.abs(start - recentEnd) > 1)
            {
                int spacesToInsert = insertSpaces(chars, start, needsWS & !spacePresent);

                for (; spacesToInsert > 0; spacesToInsert--)
                {
                    writeString(" ");
                    chars++;
                }
            }

            writeString(textHere);
            chars += textHere.length();

            needsWS = false;
            endsWithWS = textHere.endsWith(" ");
            try
            {
                recentEnd = getEndX(textPosition);
            }
            catch (IllegalArgumentException | IllegalAccessException | NoSuchFieldException | SecurityException e)
            {
                throw new IOException("Failure retrieving endX of TextPosition", e);
            }
        }
    }

    @Override
    protected void writeLineSeparator() throws IOException
    {
        super.writeLineSeparator();
        beginLine();
    }

    @Override
    protected void writeWordSeparator() throws IOException
    {
        needsWS = true;
    }

    void beginLine()
    {
        endsWithWS = true;
        needsWS = false;
        chars = 0;
    }

    int insertSpaces(int charsInLineAlready, float chunkStart, boolean spaceRequired)
    {
        int indexNow = charsInLineAlready;
        int indexToBe = (int)((chunkStart - pageLeft) / fixedCharWidth);
        int spacesToInsert = indexToBe - indexNow;
        if (spacesToInsert < 1 && spaceRequired)
            spacesToInsert = 1;

        return spacesToInsert;
    }

    float getEndX(TextPosition textPosition) throws IllegalArgumentException, IllegalAccessException, NoSuchFieldException, SecurityException
    {
        Field field = textPosition.getClass().getDeclaredField("endX");
        field.setAccessible(true);
        return field.getFloat(textPosition);
    }

    public float fixedCharWidth = 3;

    boolean endsWithWS = true;
    boolean needsWS = false;
    int chars = 0;

    PDRectangle cropBox = null;
    float pageLeft = 0;
}

It is used like this:

PDDocument document = PDDocument.load(PDF);

LayoutTextStripper stripper = new LayoutTextStripper();
stripper.setSortByPosition(true);
stripper.fixedCharWidth = charWidth; // e.g. 5

String text = stripper.getText(document);

fixedCharWidth is the assumed character width. Depending on the writing in the PDF in question a different value might be more apropos. In my sample documents values from 3..6 were of interest.

It essentially emulates the analogous solution for iText in this answer. Results differ a bit, though, as iText text extraction forwards text chunks and PDFBox text extraction forwards individual characters.

Please be aware that this is merely a proof-of-concept. It especially does not take any rotation into account

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Your solution works pretty well. It needed to be transformed a bit to match the PDBox version I was used but, first run is promising. The structure is nearly identical to the original PDF. I will use this solution if there are any better. Thank's a lot – Leor Aug 24 '17 at 07:53
  • This solution of using `LayoutTextStripper` is useful to my application. But, sometimes I am getting text like "Name and address of the Person" as "Nameandaddressof the Person" - some of the single-spaces are missing between words. I am using PDFBox 2.0.13. What can I do to get it correctly (I am using PDFBox for the first time and the changes I made to the code to run using the version 2 might be causing)? Thanks for any suggestions. – prasad_ Dec 10 '18 at 03:45
  • 1
    Okay I found a working version of the [PDFLayoutTextStripper](https://github.com/JonathanLink/PDFLayoutTextStripper) for PDFBox 2.x. – prasad_ Dec 10 '18 at 04:42
  • As mentioned in the answer, it presented a *proof-of-concept*, so some details probably still were rough. I would assume that changing (lowering) the `fixedCharWidth` value might help with the code above. – mkl Dec 10 '18 at 05:59