Hindi to english from pdf

Question

I am not able to copy hindi content from pdf file. When I am trying to copy/paste that content it changes to different hindi characters.

Example- Original- विधान सभा

After paste- नरधरन सभर

it shows like this.

Can anybody help me to get the exact hindi characters.

Try ocr solutions. There are many documents with misleading or missing text information, in particular Hindi ones. — mkl, Jun 06 '17 at 06:46

score 0 · Answer 1 · answered Jun 05 '17 at 21:55

What was used to create the PDF?

It was likely been created with an embedded font subset and no toUnicode mapping. Basically the codes of the characters used in the content of the PDF are mapped to glyphs embedded in the PDF which are displayed, but there is no mapping from these codes to regular Unicode codes so copying them produces gibberish. The only way to extract the original contents would be with some form of OCR.

Another possibility is that the application you are pasting it into is not shaping the characters correctly.

"no toUnicode mapping" - or a misleading one, cf. https://stackoverflow.com/a/30804279/1729265 — mkl, Jun 06 '17 at 06:48

Hindi to english from pdf

1 Answers1