0

I am using Camelot to parse budget documents released by different states in India. The parsing happens fine, but the output of the parsing for Devanagari (languages such as Hindi, Marathi, etc) are different from the ones in the document. The input file is on this link, and the output file after parsing is on this link. As can be seen, the Devanagari characters don't correspond to that in the input file. A MWE is shown below.

import camelot
tables = camelot.read_pdf('Demand_ Estimate.pdf', flavor='stream')
tables[0].to_csv('demand_estimate.csv')
pseudomonas
  • 423
  • 2
  • 7
  • 22
  • This is fairly common for pdfs in Indian languages, see e.g. [this question](https://stackoverflow.com/q/35917848/1729265) and other questions linked from there. – mkl Oct 14 '19 at 22:18
  • 2
    Inspection of your example PDF shows that the issue at hand indeed is the same as in those duplicated questions - the **ToUnicode** tables of the fonts in it map multiple, different looking glyphs to the same Unicode code point. Thus, text extraction (which relies on those tables) will always return such broken results. As a test you can apply simple copy&paste from Adobe Reader which also in your case returns the same broken results. Unless you try to implement your very own text extractor (which tries to rely on other, usually otherwise meaningless information), you have to try OCR. – mkl Oct 15 '19 at 10:19
  • For anyone, who is interested, we managed to find a workaround by converting the pdf to an image and then obtain the csv file. It is something that might or might not work depending on the document. For us, it is working for most of our documents. – pseudomonas Oct 26 '19 at 11:00
  • *"converting the pdf to an image and then obtain the csv file"* - That *obtaining* would be by means of OCR, I presume. – mkl Oct 28 '19 at 11:53
  • Yes. I should have added that. We used tesseract to do that – pseudomonas Oct 28 '19 at 11:54

0 Answers0