How to convert Pdf to Text with Unicode (utf-8) format using PyPdf

Question

How can I covert Pdf to Text file in Unicode (utf-8) format using PyPdf in Python?

# finally, write "output" to document-output.pdf
outputStream = file(("document-output.txt", "wb")
output.write(outputStream)
outputStream.close()

possible duplicate of [Python module for converting PDF to text](http://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text) — dyoo, Jan 26 '15 at 02:27
I used PyPDF2 .It is working for normal text. but not for unicode . — Nurul Akter Towhid, Oct 11 '16 at 01:22

score 0 · Answer 1 · edited Feb 22 '22 at 17:29

You can use PyPDF2. I use pdfminer which helps me save the file in Unicode (UTF-8). Please refer to the following code.

def convert_pdf_to_txt(self):
   rsrcmgr = PDFResourceManager()
   retstr = StringIO()
   codec = 'utf-8'  
   laparams = LAParams()
   device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
   fp = open(self.file_path, 'rb')
   interpreter = PDFPageInterpreter(rsrcmgr, device)
   pagenos = set()
   for page in PDFPage.get_pages(fp, pagenos, check_extractable=True):
       interpreter.process_page(page)
   fp.close()
   device.close()
   str = retstr.getvalue()
   retstr.close()
   return str

How to convert Pdf to Text with Unicode (utf-8) format using PyPdf

1 Answers1