Reading text from PDF in .NET

Question

I am trying to read text from a PDF into a string using the iTextSharp library.

iTextSharp.text.pdf.PdfReader pdfReader = new iTextSharp.text.pdf.PdfReader(@"C:\mypdf.pdf");
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy);
text = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText))); 
pdfReader.Close();
Console.WriteLine(text);

This normally works OK, but every few lines the whitespace will be omitted, leaving me with output like: "thisismyoutputwithoutwhitespace". The text that parses correctly seems to be the same as the text that doesn't; the same text will consistently be parsed incorrectly, which makes me think it's something within the PDFs.

score 7 · Accepted Answer · answered Dec 11 '13 at 16:10

7

In the content stream of a PDF there's no notion of "words". So in iText(Sharp)'s text extraction implementation there are some heuristics to determine how to group characters into words. When the distance between 2 characters is larger than half the width of a space in the current font, whitespace is inserted.

Most likely, the text that gets extracted without whitespace has distances between the words that are smaller than "spacewidth / 2".

In SimpleTextExtractionStrategy.RenderText():

if (spacing > renderInfo.GetSingleSpaceWidth()/2f){
    AppendTextChunk(' ');
}

You can extend SimpleTextExtractionStrategy and adjust the RenderText().

In LocationTextExtractionStrategy it is more convenient. You only need to override IsChunkAtWordBoundary():

protected bool IsChunkAtWordBoundary(TextChunk chunk, TextChunk previousChunk) {
    float dist = chunk.DistanceFromEndOf(previousChunk);
    if(dist < -chunk.CharSpaceWidth || dist > chunk.CharSpaceWidth / 2.0f)
        return true;

     return false;
}

You'll have to experiment a bit to get good results for your PDFs. "spacewidth / 2" is apparently too large in your case. But if you adjust it to be too small, you'll get false positives: whitespace will be inserted within words.

answered Dec 11 '13 at 16:10

rhens

4,791
3
22
38

Thankyou very much! This is very helpful. However, are you sure IsChunkAtBounary() is overrideable? I'm getting a "cannot override because it is not marked as abstract, virtual..". I made a new class, extended LocationTextExtractionStrategy and override the method. – John 'Mark' Smith Dec 11 '13 at 16:45
This seems to be a porting error, from Java to C#. I'll make sure this is fixed in the next release. As a workaround, I think you'll have to copy the LocationTextExtractionStrategy code, effectively creating a completely new implementation of the ITextExtractionStrategy interface. In your new implementation you can adjust the isChunkAtWordBoundary method. I know... not the cleanest solution. I'm not too familiar with C#; maybe someone with more C# experience can think of a more elegant solution. – rhens Dec 11 '13 at 17:07
If you don't have the source code of LocationTextExtractionStrategy available, you can find it here (most current version): http://sourceforge.net/p/itextsharp/code/HEAD/tree/trunk/src/core/iTextSharp/text/pdf/parser/LocationTextExtractionStrategy.cs – rhens Dec 11 '13 at 17:08
1

Also read [this answer](http://stackoverflow.com/questions/16398483/how-can-we-extract-text-from-pdf-using-itextsharp-with-spaces/20049810#20049810) – mkl Dec 11 '13 at 17:24

Reading text from PDF in .NET

1 Answers1

Linked