3

I am trying to read PDF file using iTextSharp. The issue is when trying to read a PDF file other than English (Hindi or Arabic for example) it's not getting the correct words.

I am wondering, should I install the Hindi or Arabic font on my system or do I need to do something with encoding?

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);

Edit:

Sample PDF as Image:

enter image description here

Extracted Text:

uxj ikfydk ifj"kn fuokZpd ukekoyh& 2011 i`"B la[;k % 1 1 1 1& & & & ftys dk uke ftys dk uke ftys dk uke ftys dk uke % % % % 0701-ò¶âã£ûæ– 2 2 2 2& & & & fudk fudk fudk fudk; ; ; ; dk uke dk uke dk uke dk uke % % % % 1-¢âî™ 3 3 3 3& & & & okMZ la okMZ la okMZ la okMZ la[ [ [ [; ; ; ;k o uke k o uke k o uke k o uke % % % % 1-¯â“¯â™®â£û¶âû §âîºâã®â£û¶âû Õô¯âû®â£û¶âû 4 4 4 4& & & & Hkkx la Hkkx la Hkkx la Hkkx la[ [ [ [; ; ; ;k k k k % % % %

Parwej
  • 580
  • 9
  • 30
  • 1
    See if this helps http://stackoverflow.com/a/10191879/231316 – Chris Haas Jun 05 '12 at 17:24
  • Sorry Chris, no help. I am trying to read Hindi PDF file. – Parwej Jun 05 '12 at 17:41
  • Can you post a sample PDF? If not, can you at least post the raw bytes extracted, maybe the first 20 or so? Fonts should not matter in any way for text extraction, fonts are only used for rendering. – Chris Haas Jun 05 '12 at 19:15
  • Hi Chris, Just edited the post with sample pdf as image attached and extracted text for some of the part – Parwej Jun 07 '12 at 15:42
  • Hi Chris, Please comment on my response. I did find any solution for my issue. – Parwej Jun 09 '12 at 14:57

1 Answers1

0

Do not use any kind of Encoding, because you do not know what encoding is the pdf file has.

. I think it will work.

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
text=text+currentText;

///do what you want with text
MessageBox.Show(text);

If still it not working then you have to install specific font.

Md Kamruzzaman Sarker
  • 2,387
  • 3
  • 22
  • 38