3

How can I covert Pdf to Text file in Unicode (utf-8) format using PyPdf in Python?

# finally, write "output" to document-output.pdf
outputStream = file(("document-output.txt", "wb")
output.write(outputStream)
outputStream.close()
Htet
  • 159
  • 10

1 Answers1

0

You can use PyPDF2. I use pdfminer which helps me save the file in Unicode (UTF-8). Please refer to the following code.

def convert_pdf_to_txt(self):
   rsrcmgr = PDFResourceManager()
   retstr = StringIO()
   codec = 'utf-8'  
   laparams = LAParams()
   device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
   fp = open(self.file_path, 'rb')
   interpreter = PDFPageInterpreter(rsrcmgr, device)
   pagenos = set()
   for page in PDFPage.get_pages(fp, pagenos, check_extractable=True):
       interpreter.process_page(page)
   fp.close()
   device.close()
   str = retstr.getvalue()
   retstr.close()
   return str
Dharman
  • 30,962
  • 25
  • 85
  • 135
Brajesh
  • 441
  • 1
  • 5
  • 14