How to avoid spurious spaces in words when reading PDF using iText for .NET

Question

Using iText7 (v8.0.0) I am attempting to parse a (non-PDF/A) PDF. The code is as follows:

using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;

var pdfDocument = new PdfDocument(new PdfReader("TestFile.pdf"));

var page = pdfDocument.GetPage(1);

var text = PdfTextExtractor.GetTextFromPage(page);

All is well in terms of the output except that certain words in the output text have spurious spaces in them. I understand that this is to do with the way text is rendered into the PDF. The GetTextFromPage method has an overload that takes a text extraction strategy; I tried both the default strategy implementations LocationTextExtractionStrategy and SimpleTextExtractionStrategy but neither dealt with the issue.

I am guessing that I need to define my own text extraction strategy but it isn't very obvious how to go about doing this.

(In case readers are interested, I tried the same with IronPDF and that was no better.)

hard to reply with some example PDF, but I've recently replied something that involved custom text extraction https://stackoverflow.com/a/76620874/1566339 (you can see the Interface you need to implement from it). (I'm on mobile, sorry for the lazy temporary reply) — André Lemos, Jul 17 '23 at 11:38
Many thanks for the above comments. I have subsequently been in touch with iText sales and their pricing is off the charts for commercial use, so I am abandoning this approach. A big shame as it seems that reading PDFs is something it is impossible to do with a decent open source implementation. (iText's open source is copy-left, so not suitable for my application.) — paytools-steve, Jul 17 '23 at 13:44
Often such spaces appear in text extraction is that one PDF mechanism to separate words can also be used as a mechanism to achieve kerning and text extractors have to apply heuristics to determine which is which. Also there are documents in which a space and a glyph are printed (nearly) in the same position while in the output of text extractors they follow one another. — mkl, Jul 17 '23 at 15:02
If you're dropping the idea of using iText to do text extraction, maybe have a look at the PdfPig and their wiki (https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis). It's an open source and free (Apache 2) project. Disclaimer: I'm a contributor to the project — bld, Jul 23 '23 at 15:24
Thank you for the pointer @bld (and apologies for the delay in responding). I did try PdfPig but it was giving me back each letter as a separate word. I will give it another go when I come back to this, although it wasn't clear that the project was being actively maintained - hopefully it is. — paytools-steve, Aug 23 '23 at 07:58

score 1 · Answer 1 · answered Jul 18 '23 at 09:59

Although I have subsequently decided not to use the iText library due to the licensing costs, as I managed to fix the issue, I thought I'd share my findings. Aside from the many helpful comments above, I got the base information I needed from answers to this question: how can we extract text from pdf using itextsharp with spaces?.

Here is the easiest way to address this issue. Add the following class:

using iText.Kernel.Pdf.Canvas.Parser.Listener;

internal class MyStrategy : LocationTextExtractionStrategy
{
    protected override bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk)
    {
        var chunkLocation = chunk.GetLocation();
        var previousChunkLocation = previousChunk.GetLocation();
        var chunkCharSpaceWidth = chunkLocation.GetCharSpaceWidth();

        float dist = chunkLocation.DistanceFromEndOf(previousChunkLocation);
        if (dist < -chunkCharSpaceWidth || dist > chunkCharSpaceWidth / 1.5f)
            return true;
        return false;
    }
}

I found the value 1.5f gave the best results (the default is 2.0f); your mileage may vary.

It is then a simple matter to supply the custom strategy to the processing thus:

var text = PdfTextExtractor.GetTextFromPage(page, new MyStrategy());

Note that if you need to do the same thing with the SimpleTextExtractionStrategy you pretty much have to copy and paste the entire original class, as it uses a bunch of private members which you don't have access to when inheriting.

How to avoid spurious spaces in words when reading PDF using iText for .NET

1 Answers1