I want to get the plain text of some docx files using python-docx
, but I'm struggling with the accents since the text is written in Spanish.
I'm using this answer to read the text:
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text('utf-8'))
return '\n'.join(fullText)
Which returns things like this:
n\xc3\xbamero //should be número
Is there a way I can get the text with the correct accents?
When I try to write this text to a file using this:
file = open("/mnt/c/Users/lulas/Desktop/inSpanish/txt/course1.txt","w")
file.write(text)
I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 27: ordinal not in range(128)
And it is due to how the accents are read/encoded.