5

I use this code to read pdf content using iTextSharp. it works fine when content is english but it doesn't work whene content is Persian or Arabic
Result is something like this :
Here is sample non-English PDF for test.

َٛنا Ùٔب٘طث یؿیٛ٘ زؾا ÙÙ›ÙØ­Ù” قٛمح یٔبٕس © Karl Seguin foppersian.codeplex.com www.codebetter.com 1 1 Ùٔب٘طث َٛنا یؿیٛ٘

همانرب لوصا یسیون  مرن دیلوت رتهب Ø±Ø§Ø²ÙØ§

What is the solution ?

  public string ReadPdfFile(string fileName)
        {
            StringBuilder text = new StringBuilder();

            if (File.Exists(fileName))
            {
                PdfReader pdfReader = new PdfReader(fileName);

                for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                {
                    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                    currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
                    text.Append(currentText);
                    pdfReader.Close();
                }
            }
            return text.ToString();
        }
Shahin
  • 12,543
  • 39
  • 127
  • 205
  • I think for arabic characters, it is printing their corresponding unicode characters.So before printing you need to convert them into normal string/character. – vikiiii Apr 17 '12 at 06:01
  • @vikiiii Thanks , Do you have any idea how should I do this ? – Shahin Apr 17 '12 at 06:20
  • 1
    [See this answer](http://stackoverflow.com/questions/9447648/parse-a-persian-pdf-file-to-txt-and-its-images/9454073#9454073) for an example. But even then, there **was** a problem (IIRC with 5.1.2) because Persian/Arabic are right-to-left languages. Suggest you try the latest release or SVN and see if the problem has been fixed. – kuujinbo Apr 17 '12 at 09:49
  • check this question that may help you http://stackoverflow.com/questions/16080741/convert-arabicunicode-content-html-or-xml-to-pdf-using-itextsharp – Mohamed Salah Jul 27 '15 at 14:45

1 Answers1

14

In .Net, once you have a string, you have a string, and it is Unicode, always. The actual in-memory implementation is UTF-16 but that doesn't matter. Never, ever, ever decompose the string into bytes and try to reinterpret it as a different encoding and slap it back as a string because that doesn't make sense and will almost always fail.

Your problem is this line:

currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));

I'm going to pull it apart into a couple of lines to illustrate:

byte[] bytes = Encoding.UTF8.GetBytes("ی"); //bytes now holds 0xDB8C
byte[] converted = Encoding.Convert(Encoding.Default, Encoding.UTF8, bytes);//converted now holds 0xC39BC592
string final = Encoding.UTF8.GetString(converted);//final now holds ی

The code will mix up anything above the 127 ASCII barrier. Drop the re-encoding line and you should be good.

Side-note, it is totally possible that whatever creates a string does it incorrectly, that's not too uncommon actually. But you need to fix that problem before it becomes a string, at the byte level.

EDIT

The code should be the exact same as yours above except that one line should be removed. Also, whatever you're using to display the text in, make sure that it supports Unicode. Also, as @kuujinbo said, make sure that you're using a recent version of iTextSharp. I tested this with 5.2.0.0.

    public string ReadPdfFile(string fileName) {
        StringBuilder text = new StringBuilder();

        if (File.Exists(fileName)) {
            PdfReader pdfReader = new PdfReader(fileName);

            for (int page = 1; page <= pdfReader.NumberOfPages; page++) {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                text.Append(currentText);
            }
            pdfReader.Close();
        }
        return text.ToString();
    }

EDIT 2

The above code fixes the encoding issue but doesn't fix the order of the strings themselves. Unfortunately this problem appears to be at the PDF level itself.

Consequently, showing text in such right-to-left writing systems requires either positioning each glyph individually (which is tedious and costly) or representing text with show strings (see 9.2, “Organization and Use of Fonts”) whose character codes are given in reverse order.

PDF 2008 Spec - 14.8.2.3.3 - Reverse-Order Show Strings

When re-ordering strings such as above, content is (if I understand the spec correctly) supposed to use a "marked content" section, BMC. However, the few sample PDFs that I've looked at and generated don't appear to actually do this. I absolutely could be wrong on this part because this is very much not my specialty so you'll have to poke around so more.

Chris Haas
  • 53,986
  • 12
  • 141
  • 274
  • Thank you sir , I try to fix my function according to your answer but I wasn't success would you please copy function completely in your answer ? – Shahin Apr 17 '12 at 13:38
  • Your solution works with normal text but it doesn't work when data come from pdf for pdf with content "سلام" it returns "م ﻼ ﺳ " – Shahin Apr 17 '12 at 13:44
  • shaahin, my code will fix your first problem which was just an encoding issue. Your second problem is LTR vs RTL and as kuujunbo said, that will probably need to be fixed at the iText/iTextSharp level. – Chris Haas Apr 17 '12 at 14:13
  • @ChrisHaas I solved LTR vs RTL problem. firstly I used your code provided in `http://stackoverflow.com/questions/6882098/how-can-i-get-text-formatting-with-itextsharp` and then I just modified your code a little. Replaced `this.result.Append(renderInfo.GetText());` with `var text = renderInfo.GetText(); text = String.Join(string.Empty, text.Reverse()); this.result.Append(text);` and everything was perfect. ;) – Zain Shaikh Sep 02 '12 at 08:57
  • @ChrisHaas am doing the same thing http://stackoverflow.com/questions/15385270/read-pdf-using-itextsharp-where-pdf-language-is-non-english any suggestion will be helpful – Rahul Rajput Mar 13 '13 at 12:32