UnicodeDecodeError: 'charmap' codec can't decode

Question

I want to take the text of a file, each new value on a new line, for example hi /n this is my question /n can u answer it?

My_list should look like this My_list[0] should be equal to hi, My_list[1] should be equal to this is my question and My_list[2] should be equal to can u answer it?

I tried doing so using the following

with open(r'path.docx',encoding="utf8") as f:
    content = f.readlines()
content = [x.strip() for x in content]

in an approach like the one i found in here How do I read a file line-by-line into a list?
Then i got a Syntax error for unicode do i referred to this link Why do I get a SyntaxError for a Unicode escape in my file path? added r at first and solved it for the first iteration then got this error

return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>

that i referred to this link to solve UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined> and added encoding="utf8". Still not working .

EDIT: I changed the encoding to "Latin-1" but i didn't get the output i want from print (content) instead i got stuff like that ['PK\x03\x04\x14\x00\x06\x00\x08\x00\x00\x00!\x00\t$\x87\ , again what i want and expect is a list where each line of the .docx file is an element(seperated by /n).

Adding `encoding="utf-8"` only works if your file is actually utf-8 encoded, obviously. — bruno desthuilliers, Nov 23 '17 at 11:42
What is the actual error message when you tried with UTF8? What is the encoding of the file `path.docx`? Is it UTF8 as you assume? You could check with unix `file path.docx` command, or by using the [`chardet`](https://pypi.python.org/pypi/chardet) Python package. — mhawke, Nov 23 '17 at 11:46
I tried "Latin-1" instead of "utf-8" I didn't get an error but print content basically printed nonsense, I'll update my question to showcase. — Jimmy, Nov 23 '17 at 11:48

mhawke · Accepted Answer · 2017-11-23T12:17:25.360

1

Your input file is a docx file, which is a pkzip compressed archive.

You can't open it as though it is a text file.

Instead you could look at an external package such as python-docx. Something like this might work for you:

import docx

doc = docx.Document('path.docx')
content = [p.text for p in doc.paragraphs]

edited Nov 23 '17 at 12:17

answered Nov 23 '17 at 12:11

mhawke

84,695
9
117
138

I used 'pip install docx' then ran the code according to your adjustments and got the following error 'ModuleNotFoundError: No module named 'exceptions'' where this line 'ModuleNotFoundError: No module named 'exceptions'' exists in 'docx.py' – Jimmy Nov 23 '17 at 12:35
@Jimmy: you should `pip install python-docx`. Uninstall `docx` first just to make sure that there are no clashes. – mhawke Nov 23 '17 at 12:42

score 0 · Answer 2 · answered Nov 23 '17 at 11:37

0

From the last link you give, I think the problem is that the file you are trying to read is not UTF-8 encoded. Have you tried with another encoding? There is a list here.

answered Nov 23 '17 at 11:37

smolloy

308
1
11

May you supply a method to know exactly what encoding the file is so I can check. – Jimmy Nov 23 '17 at 11:40

UnicodeDecodeError: 'charmap' codec can't decode

2 Answers2