I have been successfully able to read pdf files in my .Net code using itextsharp.pdf.pdfreader. But it misread some characters some time, there is no fixed set of characters that it misread(my observation). I think there is some limitation with this wrt the pdf pixel which it can read with 100% accuracy. Does anyone has their own observation on what has been the limitations of itextsharp.pdfreader?
Asked
Active
Viewed 220 times
0
-
Is this about text reading? Please edit your question so that we can understand your question, it doesn't make sense as it is. – Paulo Soares Jan 11 '17 at 11:59
-
Please see http://stackoverflow.com/help/how-to-ask how to ask a good question. – Flummox - don't be evil SE Jan 11 '17 at 12:39
-
2iText doesn't read "pixel text". iText reads vector data and looks at the to-Unicode table of the font inside the PDF. If that table points at the wrong characters for some glyphs, you have a PDF with a font that doesn't allow correct text extraction. In some cases that's a deliberate procedure to obfuscate content so that search engines can't spider a document with secret information correctly. – Bruno Lowagie Jan 11 '17 at 13:04
-
Probably similar to [this post](http://stackoverflow.com/questions/41497882). I answered there and gave a way to test such files using Copy-Paste. I also suggested possible solutions using either free tools like ImageMagick and Tessnet or [the LEADTOOLS professional SDK](https://www.leadtools.com/sdk/recognition-imaging). – Amin Dodin Jan 12 '17 at 22:39
-
@Amin - its not @ readability of PDF, I am able to read the pdf''s but out of 8 pgs the converted text format may contain some errorneous word/misprinted words/half text at say 5-6 occurences. So my concern is the accuracy of this reading ability of iTextsharp-like if it misreads some text what condition of pdf text leads to it or what is the max/mini level of parameter of pdf quality which will give 100% accuracy in this type of conversion,for eg- I say that the pdf quality should be minimum 100 dpi(something else) that will be best read by this method. Did I cleared my point of concern? – Pratik Jan 13 '17 at 04:07
-
Did you try to copy the text using Adobe Reader then paste it into MS Word? This should tell you if the text is stored correctly or not. Can you send me a sample PDF file using email to support@leadtools.com? – Amin Dodin Jan 15 '17 at 11:10