1

I need to read a pdf file in my C# program. The file is persian. I use code below. It works fine when the font is Tahoma for example, but when the font is persian it doesn't work. How can I add persian fonts to itextsharp when reading pdf?

An example of persian PDF: http://uplod.ir/idqrbqzzwl34/Visual_C__2005_Learning_(hashemian_).pdf.htm persian pdf is right to left but when with itextsharp text extracted, it is left to right and it is unreadable.

mansureh
  • 144
  • 2
  • 7
  • Probably the PDFs with text in Persian fonts does not contain the information required for text extraction. Can you supply a sample PDF for inspection? – mkl May 28 '14 at 06:37
  • Do the fonts you want to embed in the document allow embedding in PDF? – DavidG May 28 '14 at 07:20
  • I don't know . They have 'B' at first like Bnazanin – mansureh May 28 '14 at 07:46
  • I'm afraid I have no idea where to click on that upload.ir page to download the PDF. My blind clicks resulted in exe file downloads. – mkl May 28 '14 at 08:37
  • I uploaded it again http://www.fileswap.com/dl/F8rBq711KP/ – mansureh May 28 '14 at 09:00
  • As Bruno already found out that iText(Sharp) can extract the text, you should really show your code. Obviously something seems to be amiss in it. – mkl May 28 '14 at 12:11

1 Answers1

1

Your question is completely wrong and so is your comment to the other answer you received. You are assuming that extracted text has "a font". It hasn't. What you extract are bytes in a specific encoding (e.g. UTF-8).

Please watch this movie: https://www.youtube.com/watch?v=wxGEEv7ibHE

Text content in a PDF is stored as a sequence of characters. These characters are mapped to glyphs. E.g. the character a can be mapped to glyph such as "a", "a", "a" or any other glyph including b or c. It's just "a code" that is used to find the instructions needed to draw the letter on the page.

What you need is another mapping. You need to find the mapping between the "character" that is used as a code in the content stream and the UNICODE character it represents. There should be a ToUnicode mapping in your PDF, but... as you can see in the video I mention, not all PDFs have this mapping.

The best way to check if the text in your PDF can be extracted, is by copy/pasting text from Adobe Reader. If you succeed, you should be able to extract text programmatically; if you don't, you need to start looking for an OCR solution.

Update: I have downloaded your PDF and I've extracted the text. I don't see what is missing. Unfortunately I can't copy/paste the text here because the body of an answer is limited to 30000 characters.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • 2
    Update your question if you want to share code. Adding code in a comment is not recommended. – Bruno Lowagie May 28 '14 at 09:38
  • @BrunoLowagie I've rollback your last edit because it was not in the scope of a programming answer. – Cœur Aug 03 '18 at 02:19
  • @Cœur OK, the problem is that I don't understand Farsi, whereas the OP probably does. I couldn't check if the result I got was correct because it don't understand what the text says. I'd have to compare the different Arabic glyphs one by one on sight. By sharing the result, the OP could read it and detect any difference much faster than a person who doesn't understand Farsi. – Bruno Lowagie Aug 03 '18 at 06:14
  • @BrunoLowagie I understand you were with good intentions, but Stack Overflow is not to give individual support, it is to build Q/A that are all applicable to multiple people. See [Stack Overflow is about building a Q&A library](https://meta.stackoverflow.com/questions/371796/if-stack-overflow-is-about-building-a-qa-library-how-to-communicate-and-uphol). Note that you may, additionally to your answer, give the extra personalized info that the Asker is looking for; but in this case it was very long, so it's better to externalize it. Remember that a question can have 50 answers per page... – Cœur Aug 03 '18 at 06:22