-1

I want to take the text of a file, each new value on a new line, for example hi /n this is my question /n can u answer it?

My_list should look like this My_list[0] should be equal to hi, My_list[1] should be equal to this is my question and My_list[2] should be equal to can u answer it?

I tried doing so using the following

with open(r'path.docx',encoding="utf8") as f:
    content = f.readlines()
content = [x.strip() for x in content]

in an approach like the one i found in here How do I read a file line-by-line into a list?
Then i got a Syntax error for unicode do i referred to this link Why do I get a SyntaxError for a Unicode escape in my file path? added r at first and solved it for the first iteration then got this error

return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>

that i referred to this link to solve UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined> and added encoding="utf8". Still not working .

EDIT: I changed the encoding to "Latin-1" but i didn't get the output i want from print (content) instead i got stuff like that ['PK\x03\x04\x14\x00\x06\x00\x08\x00\x00\x00!\x00\t$\x87\ , again what i want and expect is a list where each line of the .docx file is an element(seperated by /n).

Jimmy
  • 313
  • 1
  • 4
  • 9
  • 1
    Adding `encoding="utf-8"` only works if your file is actually utf-8 encoded, obviously. – bruno desthuilliers Nov 23 '17 at 11:42
  • What is the actual error message when you tried with UTF8? What is the encoding of the file `path.docx`? Is it UTF8 as you assume? You could check with unix `file path.docx` command, or by using the [`chardet`](https://pypi.python.org/pypi/chardet) Python package. – mhawke Nov 23 '17 at 11:46
  • I tried "Latin-1" instead of "utf-8" I didn't get an error but print content basically printed nonsense, I'll update my question to showcase. – Jimmy Nov 23 '17 at 11:48
  • That is not a text file but a binary file. – Stop harming Monica Nov 23 '17 at 12:13

2 Answers2

1

Your input file is a docx file, which is a pkzip compressed archive.

You can't open it as though it is a text file.

Instead you could look at an external package such as python-docx. Something like this might work for you:

import docx

doc = docx.Document('path.docx')
content = [p.text for p in doc.paragraphs]
mhawke
  • 84,695
  • 9
  • 117
  • 138
  • I used 'pip install docx' then ran the code according to your adjustments and got the following error 'ModuleNotFoundError: No module named 'exceptions'' where this line 'ModuleNotFoundError: No module named 'exceptions'' exists in 'docx.py' – Jimmy Nov 23 '17 at 12:35
  • @Jimmy: you should `pip install python-docx`. Uninstall `docx` first just to make sure that there are no clashes. – mhawke Nov 23 '17 at 12:42
0

From the last link you give, I think the problem is that the file you are trying to read is not UTF-8 encoded. Have you tried with another encoding? There is a list here.

smolloy
  • 308
  • 1
  • 11