iTextSharp 4.1.6 Cyrillic text extraction

Question

I am using iTextSharp 4.1.6-LGPL. The text extraction logic is same as described in this answer.

            var path = @"D:\ru.pdf";

            var reader = new PdfReader(path);

            StringBuilder sb = new StringBuilder();

            try
            {
                for (int page = 1; page <= reader.NumberOfPages; page++)
                {
                    var cpage = reader.GetPageN(page);
                    var content = cpage.Get(PdfName.CONTENTS);

                    var ir = (PRIndirectReference)content;

                    var value = reader.GetPdfObject(ir.Number);

                    if (value.IsStream())
                    {
                        PRStream stream = (PRStream)value;

                        var streamBytes = PdfReader.GetStreamBytes(stream);

                        var tokenizer = new PRTokeniser(new RandomAccessFileOrArray(streamBytes));

                        try
                        {
                            while (tokenizer.NextToken())
                            {
                                if (tokenizer.TokenType == PRTokeniser.TK_STRING)
                                {
                                    string str = tokenizer.StringValue;
                                    sb.Append(str);
                                }
                            }
                        }
                        finally
                        {
                            tokenizer.Close();
                        }
                    }
                }
            }
            finally
            {
                reader.Close();
            }

            var res= sb.ToString();

Input PDF file contains only one word: Слово

Actual result for extraction is: ru-RU\u0002Á\u0003#\u0003(\u0003\u000f\u0003(

I tried different Encoding tricks with no success.

Also, newest version of iTextSharp output is correct using PdfTextExtractor which is not available in 4.6.1

Does anyone know how to get the correct output?

I'm afraid that you have already found the technical correct solution which you don't want to use for a non-technical reason. Is it the cost of the license? Another reason? Are you a lone developer or is there someone at your company you need to talk to for approval? Is there any way that iText can help you? — Amedee Van Gasse, Apr 01 '20 at 17:02
@AmedeeVanGasse, I am a lone developer in this particular case. At the moment, I am interested in minimal investment since my project is at the initial stage. Of course, I am aware of the benefits that the iText license provides and will definitely consider purchasing it in the future as soon as I am sure that this is really necessary and will help me improve my product. — Alex Wyler, Apr 07 '20 at 21:37
Okay. Maybe good to know: (1) you can get a 30 day trial license (2) if you do not distribute your software to other people, so as long as you still have it only on your own pc and nobody else is interacting with it (either directly or over network), then the license doesn't play a role. It's only when you let someone else interact with your software, that the license comes into play. DISCLAIMER: I am not a lawyer, this is not legal advice. ;-) — Amedee Van Gasse, Apr 08 '20 at 09:10

iTextSharp 4.1.6 Cyrillic text extraction

0 Answers0