0

This code returns lots of \0\0s and extracts only a few English phrases from the PDF. Any Japanese text is not returned.

I am using Unicode encoding, so I am not sure what is happening here.

StringBuilder text = new StringBuilder(2000);
string fullFileName = @"c:\my_japanaese_pdf.pdf";
PdfReader pdfReader = new iTextSharp.text.pdf.PdfReader(fullFileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
    currentText = Encoding.Unicode.GetString(UnicodeEncoding.Convert(Encoding.Unicode, Encoding.Unicode, Encoding.Unicode.GetBytes(currentText)));
    text.Append(currentText);
}
pdfReader.Close();

(Windows 7 x64, iTextSharp 5.0.2.0)

Thanks

Ryan

Ryan
  • 3,924
  • 6
  • 46
  • 69
  • Interesting that Console.WriteLine("節電対策") outputs ?????? in the console. Yet I have Japanese as a display language option in Control Panel, so assume I have native JP support on my PC. Riiiight? – Ryan Jul 02 '14 at 11:27
  • That's one problem. The other problem is the iText(Sharp) version. It's way too old. Please read http://stackoverflow.com/questions/24326767/difference-between-itextsharp-4-1-6-and-5-x-versions/ to find out what has changed since version 5.0.2 in the context of text extraction. I've also made a video that explains why some PDF don't allow you to extract text: https://www.youtube.com/watch?v=wxGEEv7ibHE Without seeing the PDF with the Japanese text, this question is unanswerable. – Bruno Lowagie Jul 02 '14 at 11:31
  • 3
    Are you sure `Encoding.Unicode.GetString(UnicodeEncoding.Convert(Encoding.Unicode, Encoding.Unicode, Encoding.Unicode.GetBytes(currentText)))` does not break more than it fixes? – mkl Jul 02 '14 at 11:57
  • Thanks @BrunoLowagie - not sure how I have an outdated copy, I'll redownload and also watch your video. So far this looks like a great library, thanks for providing it! If all works out (I can solve this JP extract issue) we'll look to incorporate to our project and license accordingly. – Ryan Jul 02 '14 at 12:48
  • 1
    @Ryan, see this for what mkl is talking about. [Once you have a string, **you have a string**, and it is Unicode, **always**](http://stackoverflow.com/a/10191879/231316) – Chris Haas Jul 02 '14 at 13:52
  • Confession: this was a copy/paste from code elsewhere. I should've looked at that line more carefully. Thanks a lot for pointing it out. – Ryan Jul 02 '14 at 14:02
  • possible duplicate of [Reading pdf content using iTextSharp in C#](http://stackoverflow.com/questions/10185643/reading-pdf-content-using-itextsharp-in-c-sharp) – Chris Haas Jul 02 '14 at 14:10
  • Good to hear, I'm going to flag this as a duplicate – Chris Haas Jul 02 '14 at 14:11
  • Hmm, even after the upgrade and removal of the encoding, it still doesn't work. All Japanese text is simply removed. I can't see the YouTube video from work (!!!) will watch from home. As @BrunoLowagie points out, maybe it's 'just not possible'. – Ryan Jul 05 '14 at 13:50

1 Answers1

1

I had this same problem, and here's what I did (note this code is extremely similar to the code in the question, but doesn't use any encoding conversion stuff).

using (iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(inputPDF))
        {
            ITextExtractionStrategy Strategy = new LocationTextExtractionStrategy();

            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                string page = PdfTextExtractor.GetTextFromPage(reader, i, Strategy);
                string[] lines = page.Split('\n');
                foreach (string line in lines)
                {
                    // do anything you want here
                }
            }
        }

Even when using the above code, I was still not getting any Japanese characters out of the PDF, so I changed the font used in the PDF to Meiryo UI font. That is how to solve this problem. Meiryo UI is a font that iTextSharp recognizes (at least version 5.5.13.2), so Japanese text with that font can successfully be extracted from the PDF.

todbott
  • 481
  • 5
  • 9