How to extract text with iTextSharp 4.1.6?

Question

iTextSharp 4.1.6 is the last version licensed under LGPL and is free to use in commercial purpose without paying license fees.

It might be interesting for some and for me, how to extract text with this version.

Does anyone have an idea?

See the following link for an example: http://stackoverflow.com/questions/2550796/reading-pdf-content-with-itextsharp-dll-in-vb-net-or-c-sharp — Hans, Apr 13 '12 at 16:30
@Hans, does that solution work with 4.1.6? ITextExtractionStrategy, SimpleTextExtractionStrategy and PdfTextExtractor are unknown to me. — Örjan Jämte, Sep 13 '12 at 12:43
I tried using the code at http://www.codeproject.com/Articles/14170/Extract-Text-from-PDF-in-C-100-NET . I found it only works for some PDFs; and it throws IndexOutOfRangeExceptions in CheckToken when it is called with single-character arguments (as that sample does). — Glenn Barnett, Oct 26 '12 at 13:19
@SpoiledTechie.com No, didn't try to fix it. I just used another solution. — der_chirurg, Aug 08 '13 at 09:15

score 11 · Answer 1 · answered Nov 24 '13 at 17:43

I had to hack this together manually as I was in the same boat as you. Hopefully this well help. It's probably not perfect, but I was able to get the text I needed out of the document this way. fileName is a string variable/parameter to the PDF file.

var reader = new PdfReader(fileName);

StringBuilder sb = new StringBuilder();

try
{
    for (int page = 1; page <= reader.NumberOfPages; page++)
    {
        var cpage = reader.GetPageN(page);
        var content = cpage.Get(PdfName.CONTENTS);

        var ir = (PRIndirectReference)content;

        var value = reader.GetPdfObject(ir.Number);

        if (value.IsStream())
        {
            PRStream stream = (PRStream)value;

            var streamBytes = PdfReader.GetStreamBytes(stream);

            var tokenizer = new PRTokeniser(new RandomAccessFileOrArray(streamBytes));

            try
            {
                while (tokenizer.NextToken())
                {
                    if (tokenizer.TokenType == PRTokeniser.TK_STRING)
                    {
                        string str = tokenizer.StringValue;
                        sb.Append(str);
                    }
                }
            }
            finally
            {
                tokenizer.Close();
            }
        }
    }
}
finally
{
    reader.Close();
}

return sb.ToString();

This is one of the poor-man's text extraction solutions one sees so often. Actually the text extraction capabilities in iText 2.1.7/4.2.0 were much more advanced than that (in spite of having quite some deficits). Most likely they are also present in the latest iTextSharp befor the license change. Give them a try! — mkl, Nov 24 '13 at 20:10
@mkl -- There is no PdfTextExtractor in iTextSharp at that version, at least not in the iTextSharp-LGPL NuGet package. This was the only way I could find to do it. If you know of a better way that is actually in the DLL, I'd appreciate it! — Paul, Nov 25 '13 at 15:09
Also I found a case where "content" is not a PRIndirectReference and instead is a PdfArray of PRIndirectReferences, so that case has to be handled accordingly as well. — Paul, Nov 25 '13 at 15:11
You are right, my assumption that the text extraction capabilities of the Java version had been ported to iTextSharp before the license change are wrong. Thus, I can think of no way short of porting the parser classes from Java iText 4.2.0 to C# yourself. I have no idea how easy or hard that is. Or, of course, you could try and switch to a current version of iTextSharp as soon as AGPL or commercial licensing become an option for you. — mkl, Nov 25 '13 at 15:36

How to extract text with iTextSharp 4.1.6?

1 Answers1

Linked