0

I wrote a code that takes a pdf document and extracts the text out of it:

from io import StringIO 

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open("C:\\Users\\document.pdf", 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

print(output_string.getvalue())

It works fine but I want to automatically save the output as a txt file. I tried importing sys and writing sys.stdout = open('document.test.txt','wt') just before the last line (the one with print), but I get:

UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-8-4a73089c6ebb> in <module>
     18         interpreter.process_page(page)
     19 sys.stdout = open('document.test.txt','wt')
---> 20 print(output_string.getvalue())

~\anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
     17 class IncrementalEncoder(codecs.IncrementalEncoder):
     18     def encode(self, input, final=False):
---> 19         return codecs.charmap_encode(input,self.errors,encoding_table)[0]
     20 
     21 class IncrementalDecoder(codecs.IncrementalDecoder):

UnicodeEncodeError: 'charmap' codec can't encode character '\u2192' in position 1614: character maps to <undefined>

I don't think there is a problem in the text itself since it otherwise works. I think this is the wrong way to save the output.

n.mathfreak
  • 145
  • 3
  • 10
  • Does this answer your question? [UnicodeEncodeError: 'charmap' codec can't encode characters](https://stackoverflow.com/questions/27092833/unicodeencodeerror-charmap-codec-cant-encode-characters) – Michael Ruth May 31 '21 at 16:46
  • You don't need to reassign stdout. `print(output_string.getvalue(), file=f)` will send the output to file, but you still have the issue that your text contains extended charaters. U+2192 is a right-facing arrow. Perhaps you need to open your file as UTF-8. – Tim Roberts May 31 '21 at 19:18

0 Answers0