I'm using iTextSharp to read a PDF file. I try to read the full text in the first page with this simple code:
var pdfReader = new PdfReader("<fileName>");
var pageText = PdfTextExtractor.GetTextFromPage(pdfReader, 1, new SimpleTextExtractionStrategy());
It returns a string like this:
"\0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 !\n\" \0 \0 \0 \0 \0 \0 # \0 $ \0 % \0 & $ \0 ’ \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 !\n\" \0 \0 \0 (\n\0 \0 \0 ) \0 \0 * \0 + , \0 , \0 \0 & , \0 - \0 . # \0 \0 \0 & $ \0 , \0 /\n+ \0 & & \0 * 0 \0 1 .\n2 \0 3\n4 - \0 5 \0 \0 $ \0 \0 # \0 \0 \0 & $ \0 , \0 * & \0 \0 ’ \0 .\n6\n\0 \0 \0 - \0 \0 \0 \0 & \0 \0 \0 \0 \0 \0 \0 , \0 # \0 \0 \0 & $ \0 , \0 \0 \0 & \0 # \0 \0 & $ ’ ) & \0 \0 \0 \0 # \0 ’ ’ \0 7 - \0 $ \0 \0 7 \0 ’ \0 , \0 8\n9 5 \0 \0 , \0 \0 $ $ \0 \0 \0 \0 \0 ’ \0 \0 3\n\0 \0 \0 ) \0 \0 \0 \0 4 - \0 5 \0 \0 $ \0 \0 * & \0 \0 ’ \0 .\n\0 \0 \0 \0 # \0 $ \0 $ \0 \0 ) \0 \0 \0 : 0 ; \0 ; < ; : 1 ; + \0 = < 9 = < < > \0 ? \0 ? \0 3 \0 (\n@\n\0 \0 # \0 $ \0 % \0 & $ \0 ’ \0 ! 3\n\0 ......"
I can read the original PDF with Acrobat Reader and browsers. The file seems to be a PDF/A.
The code I use works with other PDF.
Does iText have problem with this standard?
Can someone point me to the right direction?
Update
Copy/paste from Acrobat gives me broken text. I don't think it's an iTextSharp (5.5.10) problem.
Update
You can try with this file: PDF Example