PDF text extraction via iText returns strange characters

Question

I am using itext 5.3.4 to extract text from a PDF file. The code I am using to do this is below:

    PdfReaderContentParser parser = new PdfReaderContentParser(pdfReader);
    TextExtractionStrategy strategy;
    StringBuffer sb = new StringBuffer();

    for (int i = 1; i <= pdfReader.getNumberOfPages(); i++)
    {
        strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
        sb.append(strategy.getResultantText());
    }
    String text = sb.toString();

For a particular PDF however, an ë is returned as °. Any idea why this might happen and what can be done about it ? Is it a bug in the itext library or has there been an error in the construction of the PDF ?

Thanks for the assistance.

The very first thing to test - don't worry, it's an easy one - is to copy the text with Acrobat Reader and paste it elsewhere. If Acrobat Reader cannot read the text faithfully, chances are high that the problem lies in the PDF. — Jongware, Oct 07 '15 at 17:36
And yet another thing to do: please update. The 5.3.x versions were a time of changes in text extraction code. — mkl, Oct 07 '15 at 18:39
See http://stackoverflow.com/a/32929474/1520650 for a similar issue and a possible explanation for this behavior. — rhens, Oct 07 '15 at 20:52

score 4 · Accepted Answer · answered Oct 07 '15 at 13:47

I see two possible causes:

1. The PDF document is the problem

Some banks create documents with confidential information. To avoid that their documents are parsed and that document is extracted, they deliberately create a CMap with incorrect information. A character is linked to a glyph (and the glyph is rendered correctly), but there's also a mapping of the character to a UNICODE symbol and that mapping is deliberately wrong (so that the content can't be extracted).

I'm showing an example of such a file in these movies:

2. iText is the problem

You are using a version that dates from November 2nd, 2012. In the (almost) three years that followed, we've fixed many bugs. Maybe your problem is already solved if you upgrade to iText 5.5.7.

If upgrading to iText 5.5.7 doesn't solve the problem and if the PDF is not the problem, you may have encountered a bug in iText. If you're using iText in a commercial context, you are a customer of iText Software; in that case, please contact support at iText through the closed ticketing system that is available for customers only.

Thx Bruno, upgrading to version 5.5.7 of itext resolved the issue — frederikdebacker, Oct 08 '15 at 06:15

PDF text extraction via iText returns strange characters

1 Answers1