PDFDomTree not detecting white spaces while converting a pdf file to html

Question

I am using PDFDomTree with pdfbox-2.0.9 in my java application to convert a pdf file to html file. Following code I have used to convert a pdf.

try {   
    PDDocument document = PDDocument.load(new File("some path"));
    PDFDomTree parser = new PDFDomTree(PDFDomTreeConfig.createDefaultConfig());
    Writer output = new PrintWriter(new File("some output path"), "utf-8");

    parser.writeText(document, output);
    output.close();
    document.close();
} catch (IOException | ParserConfigurationException e) {
    throw e;
}

Now my issue is when I tried to analyse output html, I realised that the converter was not able to detect whitespace between two words due to which I got some words concatenated.

Check the comparison below:

Corresponding pdf file can be accessed from here if needed.

Can anyone please help me with this?

I'm not super familiar with this library, but have to traced the white space error to the point of origin? For example, if the issue is occurring when you create the **document** variable, there will be others that have your issue and you can look that up. If the white space issue is not present in the **PDDocument document** variable, simply replace the white text with a random set of characters, and set it back to white space explicitly when you call **parser.writeText(document, output);** — ViaTech, Aug 03 '18 at 13:36
please share the PDF document. And update to the latest version (won't solve this problem, but it is always good to use the latest version). — Tilman Hausherr, Aug 03 '18 at 16:05
I would assume that there are no white spaces in the source file, merely gaps, but these gaps are too small to be recognized as interword space. (If you look at your screen shots, the missing spaces are are at very small gaps) If you share the PDF, we can check whether this assumption holds. In that case it might be possible to tweak the text extraction to produce the white spaces. — mkl, Aug 03 '18 at 16:22
@TilmanHausherr I have added link to pdf file in question. Please check. — vsbehere, Aug 03 '18 at 19:46
@mkl I have checked for this case also. These are actually white spaces and not gaps. When I copied text from pdf and pasted it in simple text editor, text appeared with white spaces appropriately. — vsbehere, Aug 03 '18 at 19:52
PDFDomTree is not from PDFBox, this is from a third party project. The PDFBox ExtractText utility works fine. (And it can also convert to HTML) — Tilman Hausherr, Aug 03 '18 at 20:14
vsbehere - my assumption was based on another assumption: that you used pdfbox' text extraction classes. As @Tilman pointed out, this is not the case. I made that fact clearer both in your question title and your question text. — mkl, Aug 04 '18 at 05:17
vsbehere - *"These are actually white spaces and not gaps."* - you are right, furthermore "spetto al mese precedente, per effetto" (including the spaces) actually is drawn by a single instruction in the PDF. Dropping the spaces in such a case shows a surprising deficit in the text extractor. As @TilmanHausherr already indicated, PDFBox' text extraction works fine here. — mkl, Aug 06 '18 at 10:01
I just had a look at the actually extracted data. It looks like pdf2dom actually considers the whole line a single word because the spaces are too small. Ah, I just found `if (!text.getUnicode().trim().isEmpty())` in its `processTextPosition` override. This will drop any space character because the characters come one-by-one. Thus, these spaces are not taken into account later anymore. — mkl, Aug 06 '18 at 11:15
Yes, I realised that. I am working to find a way around or a solution for it. Meanwhile if you find or suggest anything, that would be helpful. Thanks. — vsbehere, Aug 06 '18 at 11:18
The very least you should do is to report the issue https://github.com/radkovo/Pdf2Dom/issues . — Tilman Hausherr, Aug 06 '18 at 11:50
@TilmanHausherr can you please guide me on how to use PDFBox ExtractText utility to convert a pdf to html, and will it persist all styling information also? — vsbehere, Aug 07 '18 at 13:02
See https://pdfbox.apache.org/2.0/commandline.html , the command is java -jar pdfbox-app-2.0.11.jar ExtractText -html file.pdf . It will keep some styling information but not all. — Tilman Hausherr, Aug 07 '18 at 13:05
@TilmanHausherr I came across another pdf recently in which extractText utility is failing and words are getting merged. The word 'che nel' in 3rd row is getting combined. You can access pdf at https://drive.google.com/open?id=1Mokm8HbiS22YMhVOXTae9IjNfBvCJqdI — vsbehere, May 17 '19 at 05:07
This is probably a difficult decision… the gaps are so tiny that PDFBox makes the wrong decision whether the gap is a space or not. You could try to play with the `spacingTolerance` and the `averageCharTolerance` of the stripper. It's not available in the command line utility but if you want it, I'll add it if you create an issue in PDFBox JIRA. — Tilman Hausherr, May 18 '19 at 11:08

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

The text extractor at hand, Pdf2Dom's PDFDomTree, is based on PDFBox' PDFTextStripper but only uses it to parse the PDF drawing instructions into characters with style and position while it does all the analysis of these rich characters itself.

In particular it ignores all incoming white space characters in its PDFBoxTree parent class:

protected void processTextPosition(TextPosition text)
{
    if (text.isDiacritic())
    {
        lastDia = text;
    }
    else if (!text.getUnicode().trim().isEmpty())
    {
        [...process character...]
    }
}

(org.fit.pdfdom.PDFBoxTree override processTextPosition)

In that [...process character...] block it tries to recognize word gaps by hard coded distances:

        //should we split the boxes?
        boolean split = lastText == null || distx > 1.0f || distx < -6.0f || Math.abs(disty) > 1.0f
                            || isReversed(getTextDirectionality(text)) != isReversed(getTextDirectionality(lastText));

(inside the [...process character...] block above)

As the text in your PDF is small to start with (9pt determined by Pdf2Dom) and in many lines very tightly set, gaps between words usually are smaller than the 1.0 assumed above (distx > 1.0f).

In my eyes there a 2 issues here:

dropping white spaces means throwing away information; (In some situations this might be advantageous, I've seen PDFs with the same line drawn twice with either drawing string argument containing spaces where the other contains visible characters; but these are exceptions.)
having hard-coded distance limits distx > 1.0f, distx < -6.0f, etc. even though the font sizes (and with them the gap sizes) can vary much.

These issues should be fixed in the code. Two possible work-arounds for PDFs like your demo.pdf:

Choosing different distance limits

A true fix should try and make the distance limits dynamic, depending on the font size and probably even the average character distance in the current line up to the current position. A work-around for your PDF would be to replace the hard-coded distance by a smaller hard-coded one.

E.g. using .5f instead of the 1.0f as word distance, i.e. replacing the test above by

        //should we split the boxes?
        boolean split = lastText == null || distx > .5f || distx < -6.0f || Math.abs(disty) > 1.0f

This results in Pdf2Dom recognizing the word gaps in your document (or at least many more, I have not checked all of them).

Interpreting white spaces as splits

Instead of ignoring white spaces, you can explicitly interpret them as word gaps, e.g. by enhancing the processTextPosition override like this

protected void processTextPosition(TextPosition text)
{
    if (text.isDiacritic())
    {
        lastDia = text;
    }
    else if (!text.getUnicode().trim().isEmpty())
    {
        [...process character...]
    } else {
//!! process white spaces here
        //finish current box (if any)
        if (lastText != null)
        {
            finishBox();
        }
        //start a new box
        curstyle = new BoxStyle(style);
        lastText = null;
    }
}

I have not analyzed the code in depth, so I can only call this a work-around. To make it a real fix, you have to test it for side effects and also extend it to look into the exact nature of the white space: There are other white space characters than the normal space, some of them zero-width, some non-breaking, etc. All these different types of white space deserve special treatment.

PS: As many PDFBoxTree members are protected (and not private), it is easily possible to apply the second work-around without having to patch Pdf2Dom:

PDDocument document = PDDocument.load(SOURCE);

PDFDomTree parser = new PDFDomTree(PDFDomTreeConfig.createDefaultConfig()) {
    @Override
    protected void processTextPosition(TextPosition text) {
        if (text.getUnicode().trim().isEmpty()) {
            //finish current box (if any)
            if (lastText != null)
            {
                finishBox();
            }
            //start a new box
            curstyle = new BoxStyle(style);
            lastText = null;
        } else {
            super.processTextPosition(text);
        }
    }
};
Writer output = new PrintWriter(TARGET, "utf-8");

parser.writeText(document, output);
output.close();

(ExtractText test testDemoImproved)

PDFDomTree not detecting white spaces while converting a pdf file to html

1 Answers1

Choosing different distance limits

Interpreting white spaces as splits