I want to scrape a Hindi(Indian Langage) pdf file with python

Question

I have written python code that scrapes all the data from the PDF file. The problem here is that once it is scraped,the words lose their grammer. How to fix these problem? I am attaching the code.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
   rsrcmgr = PDFResourceManager()
   retstr = StringIO()
   codec = 'utf-8'
   laparams = LAParams()
   device = TextConverter(rsrcmgr, retstr, codec=codec,laparams=laparams)
   with open(path, 'rb') as fp:
         interpreter = PDFPageInterpreter(rsrcmgr, device)
         password = ""
         caching = True
         pagenos = set()

         for page in PDFPage.get_pages(fp, pagenos, password=password,caching=caching, check_extractable=True):
             interpreter.process_page(page)
         text = retstr.getvalue()
  device.close()
  retstr.close()
  return text
print convert_pdf_to_txt("S24A276P001.pdf")

and here is the screenshot of PDF.

If you copy paste from pdf viewer, you can properly see correct text? — YOU, Mar 15 '16 at 01:58
sometimes text is not stored as is in some pdf files in some language I seen before, that mean you need to write custom decoder for that. without the knowledge of language, not much I can do here. — YOU, Mar 15 '16 at 02:09
How to write a genral coustom decoder?? If you can help me with that may be I can figure out a way. — Abhinav Mishra, Mar 15 '16 at 02:12
To write decoder, need to understand the language and grammar, which I dont speak. may be you can post sets of correct texts and incorrect texts, but there is alot of chance that I wont have a clue. — YOU, Mar 15 '16 at 02:19
नाम gets changed into नपम, राम chnages to रपम — Abhinav Mishra, Mar 15 '16 at 02:39
Its only one character changes - u"\u0928\u093e\u092e" to u"\u0928\u092a\u092e", which is \u093e changes to \u092a, so may be change \u092a to \u093a or \u092a\u092e to \u093e\u092e may be it could solve that particular case. — YOU, Mar 15 '16 at 05:33
Can you please elaborate. I dint get you!! And what would be the loguc behind the decoder — Abhinav Mishra, Mar 15 '16 at 07:12
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/106322/discussion-between-abhinav-mishra-and-you). — Abhinav Mishra, Mar 15 '16 at 07:15
decoding is converting one set of characters to another, in this case, it is just a replace function, eg `text = text.replace(u"\u092a\u092e", u"\u093e\u092e")` — YOU, Mar 15 '16 at 09:46
@YOU: What if i have dynamic hindi (indian) words in pdf. Which is the best way to extract it? Even if we copy these text from pdf and paste in another document. It has problem. Can you please guide on this? Thanks — Niks Jain, Nov 22 '17 at 17:10

score 4 · Accepted Answer · answered Mar 21 '16 at 20:09

4

Best way to solve the problem is use textract module from python and load hindi test data from its github repository and write the extracted text to a txt file. This solved my problem.

answered Mar 21 '16 at 20:09

Abhinav Mishra

195
13

3

Can you please elaborate the solution with a simple example would help us? Thanks – Niks Jain Nov 22 '17 at 15:04

I want to scrape a Hindi(Indian Langage) pdf file with python

1 Answers1

Linked