9

I am not able to copy hindi content from pdf file. When I am trying to copy/paste that content it changes to different hindi characters.

Example-

Original- निर्वाचक

After paste- ननरररचक

it shows like this.

Anybody can help me to get the exact hindi characters.

Savendra Singh
  • 101
  • 1
  • 1
  • 7
  • Very often hindi fonts are embedded with incorrect glyph-to-Unicode mappings. Applying OCR might be necessary. – mkl Jun 10 '15 at 13:16
  • It's impossible to help you in any way without seeing an actual PDF document showing this problem. – David van Driessche Jun 10 '15 at 16:24
  • Hello @SavendraSingh I am facing exactly same issue with a similar document. I need a favour from you. Can you share how did you resolve this issue. How did you read the document?? Your response will be really helpful to me.. – Viraj Nalawade Oct 29 '16 at 09:11
  • I solved this issue with OCR. I did complete voter data extraction for Karnataka. – Naveed Jun 07 '18 at 06:19

1 Answers1

6

This issue is similar to the one discussed in this answer, and the appearance of the sample document there does also remind of the document here:

In a nutshell

Your document itself provides the information that e.g. the glyphs "निर्वाचक" in the head line represent the text "ननरररचक". You should ask the source of your document for a document version in which the font informations are not misleading. If that is not possible, you should go for OCR.

In detail

The top line of the first page is generated by the following operations in the page content stream:

/9 239 Tf
( !"#$%&) Tj 

The first line selects the font named 9 at a size of 239 (an operation at the beginning of the page scales everything down). The second line causes glyphs to be printed. These glyphs are referenced inbetween the brackets using the custom encoding of that font.

The font 9 on the first page of your PDF contains a ToUnicode map. This map especially maps

<20> <20> <0928>
<21> <21> <0928>
<22> <22> <0930>
<23> <23> <0930>
<24> <24> <0930> 

i.e. the codes 0x20 (' ') and 0x21 ('!') both map to the Unicode code point 0x0928 ('न') and the codes 0x22 ('"'), 0x23 ('#'), and 0x24 ('$') all to the Unicode code point 0x0930 ('र').

Thus, the contents of ( !"#$%&), displayed as "निर्वाचक", completely correctly (according to the information in the document) are extracted / copy&pasted as "ननरररचक".

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • I have more than 250 pdf files of same type and i am not able to extract given content and OCR also not working correctly it missed many of the characters. – Savendra Singh Jun 12 '15 at 17:31
  • @mkl can you then please explain, how to solve this issue? I have got the problem that you are trying to raise but how to solve this issue is still not clear – proprius Oct 23 '15 at 13:49
  • @proprius *can you then please explain, how to solve this issue* - in [another answer](http://stackoverflow.com/a/31923094/1729265) it looked like there were actually only a few font dictionaries in each PDF. One solution would be to present each glyph of each of these fonts to a user who would then provide the correct Unicode character. From these information one can then build a **ToUnicode** map for each font object and replace the original one. – mkl Oct 23 '15 at 14:08
  • @proprius If you have many documents and the fonts in them are subsets of the same few actual full fonts, you can more and more automatize this by recognizing glyphs already mapped to Unicode by the user before and re-using his former input. – mkl Oct 23 '15 at 14:11
  • @proprius Creating this tool is a non-trivial project in its own right, the developer should know his way around in PDF internals and font format internals. If you need to process very many such documents, the work may pay out, though. – mkl Oct 23 '15 at 14:29
  • i dont have many documents, Infact, I have to work on similar document that other users(@SavendraSingh, @Rohit) have asked for. For now, a simple method would help. – proprius Oct 24 '15 at 01:23
  • @mkl : Could it be that, instead of `utf-8`, some other encoding can be used with this documennt type? – aspiring1 Sep 04 '19 at 06:37
  • @aspiring1 the font encodings used here are ad-hoc encodings (completely non-standard, single user only). The **ToUnicode** maps are meant to map that to unicode, but they simply lie here. – mkl Sep 04 '19 at 07:54
  • @mkl : I would like to ask , if there is a reason why such mapping problems are there in pdfs, human error, or is there a certain reasoning behind it, because I find these in standard govt. docs too. – aspiring1 Sep 04 '19 at 08:01
  • *"I would like to ask , if there is a reason why such mapping problems are there in pdfs"* - I'm not sure but I'd guess that in the case at hand the PDF generator simply is deficient. There are cases, though, in which by design the mapping has been corrupted to make text extraction difficult, see for example [this answer](https://stackoverflow.com/a/22688775/1729265). – mkl Sep 04 '19 at 08:20
  • @mkl : How are (' ') mapped to 0x20, 0x21 ('!') then to the `unicode` code point 0x0928 ('न'), I have got the page content stream, but I only get the **fontmap to unicode mapping** such as `<01> <0020>` as you can [see](https://stackoverflow.com/q/57952798/8030107) here. – aspiring1 Sep 16 '19 at 09:04
  • @aspiring1 I don't really understand what your question is here. – mkl Sep 16 '19 at 13:10
  • @aspiring1 *"How are (' ') mapped to 0x20"* - in the answer above I posted an excerpt of the content stream. Strictly speaking there is no space character ' ' in the content stream but there is a byte 0x20 in the content stream. Merely for visualization I said that there is the instruction `( !"#$%&) Tj`, actually there is the instruction comprised of the byte sequence 0x28 0x20 0x21 0x22 0x23 0x24 0x25 0x26 0x29 0x20 0x54 0x6A. But here you immediately have the hex values mapped by the **ToUnicode** map. – mkl Sep 18 '19 at 10:15