6

I have written python code that scrapes all the data from the PDF file. The problem here is that once it is scraped,the words lose their grammer. How to fix these problem? I am attaching the code.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
   rsrcmgr = PDFResourceManager()
   retstr = StringIO()
   codec = 'utf-8'
   laparams = LAParams()
   device = TextConverter(rsrcmgr, retstr, codec=codec,laparams=laparams)
   with open(path, 'rb') as fp:
         interpreter = PDFPageInterpreter(rsrcmgr, device)
         password = ""
         caching = True
         pagenos = set()

         for page in PDFPage.get_pages(fp, pagenos, password=password,caching=caching, check_extractable=True):
             interpreter.process_page(page)
         text = retstr.getvalue()
  device.close()
  retstr.close()
  return text
print convert_pdf_to_txt("S24A276P001.pdf")

and here is the screenshot of PDF. PDF SCREEN SHOT

ml-moron
  • 888
  • 1
  • 11
  • 22
  • If you copy paste from pdf viewer, you can properly see correct text? – YOU Mar 15 '16 at 01:58
  • No text is not correct after i copy paste it – Abhinav Mishra Mar 15 '16 at 01:59
  • sometimes text is not stored as is in some pdf files in some language I seen before, that mean you need to write custom decoder for that. without the knowledge of language, not much I can do here. – YOU Mar 15 '16 at 02:09
  • How to write a genral coustom decoder?? If you can help me with that may be I can figure out a way. – Abhinav Mishra Mar 15 '16 at 02:12
  • To write decoder, need to understand the language and grammar, which I dont speak. may be you can post sets of correct texts and incorrect texts, but there is alot of chance that I wont have a clue. – YOU Mar 15 '16 at 02:19
  • नाम gets changed into नपम, राम chnages to रपम – Abhinav Mishra Mar 15 '16 at 02:39
  • Its only one character changes - u"\u0928\u093e\u092e" to u"\u0928\u092a\u092e", which is \u093e changes to \u092a, so may be change \u092a to \u093a or \u092a\u092e to \u093e\u092e may be it could solve that particular case. – YOU Mar 15 '16 at 05:33
  • Can you please elaborate. I dint get you!! And what would be the loguc behind the decoder – Abhinav Mishra Mar 15 '16 at 07:12
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/106322/discussion-between-abhinav-mishra-and-you). – Abhinav Mishra Mar 15 '16 at 07:15
  • decoding is converting one set of characters to another, in this case, it is just a replace function, eg `text = text.replace(u"\u092a\u092e", u"\u093e\u092e")` – YOU Mar 15 '16 at 09:46
  • @YOU: What if i have dynamic hindi (indian) words in pdf. Which is the best way to extract it? Even if we copy these text from pdf and paste in another document. It has problem. Can you please guide on this? Thanks – Niks Jain Nov 22 '17 at 17:10
  • @AbhinavMishra do you have code to convert hindi text?? – shiv shankar keshari Jan 20 '21 at 19:04

1 Answers1

4

Best way to solve the problem is use textract module from python and load hindi test data from its github repository and write the extracted text to a txt file. This solved my problem.