I wrote a code that takes a pdf document and extracts the text out of it:
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
output_string = StringIO()
with open("C:\\Users\\document.pdf", 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
print(output_string.getvalue())
It works fine but I want to automatically save the output as a txt file. I tried importing sys
and writing sys.stdout = open('document.test.txt','wt')
just before the last line (the one with print
), but I get:
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-8-4a73089c6ebb> in <module>
18 interpreter.process_page(page)
19 sys.stdout = open('document.test.txt','wt')
---> 20 print(output_string.getvalue())
~\anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
17 class IncrementalEncoder(codecs.IncrementalEncoder):
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
20
21 class IncrementalDecoder(codecs.IncrementalDecoder):
UnicodeEncodeError: 'charmap' codec can't encode character '\u2192' in position 1614: character maps to <undefined>
I don't think there is a problem in the text itself since it otherwise works. I think this is the wrong way to save the output.