0

I have been trying to read a .docx file and copy its text to a .txt file

I started off by writing this piece of script for achieving the above results.

if extension == 'docx' :

   document = Document(filepath)
      for para in document.paragraphs:
         with open("C:/Users/prasu/Desktop/PySumm-resource/CodeSamples/output.txt","w") as file:
            file.writelines(para.text)

The error occurred is as follows :

Traceback (most recent call last):
  File "input_script.py", line 27, in <module>
    file.writelines(para.text)
  File "C:\Python\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2265' in 
position 0: character maps to <undefined>

I tried printing "para.text" with the help of print(), it works. Now, I want to write "para.text" to a .txt file.

1 Answers1

0

You could try using codecs.

Based on your error message it seems that the following character "≥" is causing issues. Outputting in utf-8 with codecs should hopefully solve your issue.

from docx import Document
import codecs
filepath = r"test.docx"
document = Document(filepath)
for para in document.paragraphs:
    with codecs.open('output.txt', 'a', "utf-8-sig") as o_file:
        o_file.write(para.text)
    o_file.close()