This code returns lots of \0\0s and extracts only a few English phrases from the PDF. Any Japanese text is not returned.
I am using Unicode encoding, so I am not sure what is happening here.
StringBuilder text = new StringBuilder(2000);
string fullFileName = @"c:\my_japanaese_pdf.pdf";
PdfReader pdfReader = new iTextSharp.text.pdf.PdfReader(fullFileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.Unicode.GetString(UnicodeEncoding.Convert(Encoding.Unicode, Encoding.Unicode, Encoding.Unicode.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
(Windows 7 x64, iTextSharp 5.0.2.0)
Thanks
Ryan