Overwriting the ToUnicode Map stream in a PDF

Asked Mar 21 '20 at 08:38

Active Mar 21 '20 at 08:38

Viewed 196 times

In this question, mkl provides a fantastic answer to pnj's predicament. We are unfortunately facing a very similar issue (with a different font called Lohit - Devanagari, but still a Devanagari font) The second comment outlines the non-OCR solution steps beautifully, but I suffer from a huge lacuna in my understanding of PDFs and their structure. As such, it would be great if some direction can be given in terms of the following:

overwrite the ToUnicode map in this PDF using a general purpose PDF library with a low-level object access API for a programming language of your choice: What library in Python can I use to do this?
traversing the PDF object structure, finding the ToUnicode map stream, replacing its content, and saving the result.: Is there some example where I can see how exactly this is done for any font out there?

I hope this isn't too broad. Thank you!

asked Mar 21 '20 at 08:38

wireman

Overwriting the ToUnicode Map stream in a PDF

0 Answers0