c# How to read/convert/extract Hebrew pdf content to text by using iTextSharp

Question

I am trying to extract Hebrew text pdf by using iTextSharp

This is my Code:

 public string ReadPdfFile(string fileName)
    {
        StringBuilder text = new StringBuilder();

        if (File.Exists(fileName))
        {
            PdfReader pdfReader = new PdfReader(fileName);

            for (int page = 1; page <= pdfReader.NumberOfPages; page++)
            {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                text.Append(currentText);
            }
            pdfReader.Close();
        }
        return text.ToString();
    }

and this is the result I get: Click Here to see the image

The English text came out ok but non the Hebrew part

how can I extract Hebrew text?

Thanks in advance

Is that screenshot from the Console? In that case, do not trust what you see, by default the Console will not display unicode correctly (see http://stackoverflow.com/questions/5750203/how-to-write-unicode-characters-to-the-console et al). Write the text to a file and open it with an application you know supports unicode and uses a font that contains the Hebrew characters. — Paul-Jan, Jun 18 '16 at 12:32
@Paul-Jan I copy the text to .txt file and you were right, it was different then the console. I got all the hebrew text but I got a new problem now, the text is upside down since hebrew is written right to left while eng is left to right — barak, Jun 18 '16 at 15:02
Hebrew will sort of work, the problem is that no reordering is done and the text will be presented in presentation order and not in logical order as expected for a RTL language. — Paulo Soares, Jun 18 '16 at 15:02
There's s bidi class in iTextSharp to do the reordering that may work here, the algorithm is mostly reversible. — Paulo Soares, Jun 18 '16 at 15:11
Although not directly related to your question your extraction code is actually incorrect and will probably eventually break. Basically, remove the entire line that tries to "fix" the encoding because it actually doesn't do what you think it does. See [this answer](http://stackoverflow.com/a/10191879/231316) for a more in-depth explanation. — Chris Haas, Jun 19 '16 at 16:08

c# How to read/convert/extract Hebrew pdf content to text by using iTextSharp

0 Answers0