0

i have a searchable pdf [language : hindi] example : https://www.ceorajasthan.nic.in/erolls/pdf/dper-18/A151/A151001.pdf .

i have the font file for it download link (http://ceorajasthan.nic.in/erolls/pdf/Forms/mfdev010.ttf)

i want to integrate this font , if glyph is destroyed i want to map it correctly.

i have managed to get the glyph file which has lines like this :

(Abc.glf) 131|0xc1|Aacute|00c1|400b0234121005002502120526002b35012b35|

pdf is having devnagri script , i am not able to proceed any further with any improvement . any help would be helpful .

language that i am using is python 2.7

aryan singh
  • 151
  • 11
  • What is the overall issue? That PDF appears fine, with embedded subsetted fonts, and what looks to be correct unicode mapping (albeit I cannot read hindi). Also, is this the only file, or do you have many like this? Who made them? – Ryan Jul 16 '18 at 18:01
  • starting word in the pdf is िनवार्चक but when i copy paste or do pdftotext it comes to be ननरररचक , mainly because there are wrong mapping present or mapping is absent. i want to fix the pdf , what should i do ... – aryan singh Jul 18 '18 at 10:18
  • Who created the PDF? Do you just have this one, or lots like this? These issues are best resolved at the point of creation. Trying to repair afterwards is very difficult. – Ryan Jul 18 '18 at 16:40
  • @Ryan This reminds very much of the issue explained in [this answer](https://stackoverflow.com/a/30804279/1729265). Essentially a **ToUnicode** map which is partially incorrect. – mkl Aug 12 '18 at 07:20
  • Yes, Aryan, you should see the answer mentioned by @mkl. That might work for your file(s). If not, then you need to post the file here for any further assistance. – Ryan Aug 12 '18 at 21:52

0 Answers0