In this question, mkl provides a fantastic answer to pnj's predicament. We are unfortunately facing a very similar issue (with a different font called Lohit - Devanagari, but still a Devanagari font) The second comment outlines the non-OCR solution steps beautifully, but I suffer from a huge lacuna in my understanding of PDFs and their structure. As such, it would be great if some direction can be given in terms of the following:
- overwrite the ToUnicode map in this PDF using a general purpose PDF library with a low-level object access API for a programming language of your choice: What library in Python can I use to do this?
- traversing the PDF object structure, finding the ToUnicode map stream, replacing its content, and saving the result.: Is there some example where I can see how exactly this is done for any font out there?
I hope this isn't too broad. Thank you!