Pdfbox 2.0.8 Unicode(hindi) some characters not displayed correctly on pdf file

Question

I'm using pdfbox library to create pdf file using unicode(hindi language text).File is created but some characters are not displayed correctly.I am using Mangal font ttf file

File file = new File("D:\mangal.ttf");

PDFont font = PDType0Font.load(pd, file);

original Text

जीएसटी में कंपोजीशन स्कीम की आड़ में टैक्स चोरी करने वाले व्यापारियों की अब खैर नहीं। जीएसटी काउंसिल ऐसे व्यापारियों पर नकेल कसने के लिए ‘रिवर्स चार्ज मैकेनिज्म’ के प्रावधान को लागू करने की तैयारी कर रही है। बताया जाता है कि सबसे पहले यह विवादित प्रावधान कंपोजीशन स्कीम के डीलरों पर ही लागू किया जाएगा। बाद में दूसरे कारोबारी इसके दायरे में आएंगे। काउंसिल ने इस दिशा में कदम उठाते हुए एक मंत्रिसमूह का गठन किया है।

Generated Text

One more thing when i copy text from generated pdf file and paste to m s word it will also displayed correctly. As you can see below text

जीएसटी में कंपोजीशन स्कीम की आड़ में टैक्स चोरी करने वाले व्यापारियों की अब खैर नहीं। जीएसटी काउंसिल ऐसे व्यापारियों पर नकेल कसने के लिए ‘रिवर्स चार्ज मैकेनिज्म’ के प्रावधान को लागू करने की तैयारी कर रही है। बताया जाता है कि सबसे पहले यह विवादित प्रावधान कंपोजीशन स्कीम के डीलरों पर ही लागू किया जाएगा। बाद में दूसरे कारोबारी इसके दायरे में आएंगे। काउंसिल ने इस दिशा में कदम उठाते हुए एक मंत्रिसमूह का गठन किया है।

It is very difficult for a non Indian to identify what went wrong because you highlighted the results but not the source, but I think that "टैक्स" is the first one. But I suspect the cause is that PDFBox doesn't support complex scripts, i.e. replacing glyphs with other glyphs depending on context. When removing the last character in an editor, I get "टैक्" (removed 1 character), "टैक" (removed 2 characters), "टै" (removed 3), "ट" (removed 4). What I mean is that the glyph क wasn't there at the beginning, but it appears depending on context. — Tilman Hausherr, Mar 26 '18 at 08:37
thanks for reply @TilmanHausherr, As you said pdfbox not support complex scripts so what is the alternate way to doing same? I want to mention one more thing as pdfbox doc said 2.0 version support unicode font(https://pdfbox.apache.org/2.0/migration.html) and i am also using unicode font so what is reason some character not displayed correctly. And my second point when i copy text from pdf and to any other editor text display correctly — Manish Pandey, Mar 26 '18 at 10:03
There is a solution for Arabic https://stackoverflow.com/questions/48284888/writing-arabic-with-pdfbox-with-correct-characters-presentation-form-without-bei but I couldn't find anything similar for Hindi. — Tilman Hausherr, Mar 26 '18 at 10:10
Hi, I have tried use ICU library but din't found any appropriate method for hindi just like Arabic character. So what is alternate any other library or something else what can i try? — Manish Pandey, Mar 26 '18 at 13:39
According this document http://unicode.org/faq/indic.html "Unicode provides a way to force the display engine to show a half letter form. To do this, an invisible character called ZERO WIDTH JOINER (\u200d)should be inserted after the virama" so i have tried to append \u200d after virama but not working in my case. Why it is not working? — Manish Pandey, Mar 29 '18 at 08:02

Pdfbox 2.0.8 Unicode(hindi) some characters not displayed correctly on pdf file

0 Answers0