How can I Extract Japanese Text from a PDF using iTextSharp?

Question

This code returns lots of \0\0s and extracts only a few English phrases from the PDF. Any Japanese text is not returned.

I am using Unicode encoding, so I am not sure what is happening here.

StringBuilder text = new StringBuilder(2000);
string fullFileName = @"c:\my_japanaese_pdf.pdf";
PdfReader pdfReader = new iTextSharp.text.pdf.PdfReader(fullFileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
    currentText = Encoding.Unicode.GetString(UnicodeEncoding.Convert(Encoding.Unicode, Encoding.Unicode, Encoding.Unicode.GetBytes(currentText)));
    text.Append(currentText);
}
pdfReader.Close();

(Windows 7 x64, iTextSharp 5.0.2.0)

Thanks

Ryan

Interesting that Console.WriteLine("節電対策") outputs ?????? in the console. Yet I have Japanese as a display language option in Control Panel, so assume I have native JP support on my PC. Riiiight? — Ryan, Jul 02 '14 at 11:27
That's one problem. The other problem is the iText(Sharp) version. It's way too old. Please read http://stackoverflow.com/questions/24326767/difference-between-itextsharp-4-1-6-and-5-x-versions/ to find out what has changed since version 5.0.2 in the context of text extraction. I've also made a video that explains why some PDF don't allow you to extract text: https://www.youtube.com/watch?v=wxGEEv7ibHE Without seeing the PDF with the Japanese text, this question is unanswerable. — Bruno Lowagie, Jul 02 '14 at 11:31
Are you sure `Encoding.Unicode.GetString(UnicodeEncoding.Convert(Encoding.Unicode, Encoding.Unicode, Encoding.Unicode.GetBytes(currentText)))` does not break more than it fixes? — mkl, Jul 02 '14 at 11:57
Thanks @BrunoLowagie - not sure how I have an outdated copy, I'll redownload and also watch your video. So far this looks like a great library, thanks for providing it! If all works out (I can solve this JP extract issue) we'll look to incorporate to our project and license accordingly. — Ryan, Jul 02 '14 at 12:48
@Ryan, see this for what mkl is talking about. [Once you have a string, **you have a string**, and it is Unicode, **always**](http://stackoverflow.com/a/10191879/231316) — Chris Haas, Jul 02 '14 at 13:52
Confession: this was a copy/paste from code elsewhere. I should've looked at that line more carefully. Thanks a lot for pointing it out. — Ryan, Jul 02 '14 at 14:02
possible duplicate of [Reading pdf content using iTextSharp in C#](http://stackoverflow.com/questions/10185643/reading-pdf-content-using-itextsharp-in-c-sharp) — Chris Haas, Jul 02 '14 at 14:10
Hmm, even after the upgrade and removal of the encoding, it still doesn't work. All Japanese text is simply removed. I can't see the YouTube video from work (!!!) will watch from home. As @BrunoLowagie points out, maybe it's 'just not possible'. — Ryan, Jul 05 '14 at 13:50

score 1 · Answer 1 · answered Dec 22 '21 at 05:14

I had this same problem, and here's what I did (note this code is extremely similar to the code in the question, but doesn't use any encoding conversion stuff).

using (iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(inputPDF))
        {
            ITextExtractionStrategy Strategy = new LocationTextExtractionStrategy();

            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                string page = PdfTextExtractor.GetTextFromPage(reader, i, Strategy);
                string[] lines = page.Split('\n');
                foreach (string line in lines)
                {
                    // do anything you want here
                }
            }
        }

Even when using the above code, I was still not getting any Japanese characters out of the PDF, so I changed the font used in the PDF to Meiryo UI font. That is how to solve this problem. Meiryo UI is a font that iTextSharp recognizes (at least version 5.5.13.2), so Japanese text with that font can successfully be extracted from the PDF.

How can I Extract Japanese Text from a PDF using iTextSharp?

1 Answers1