I'm new to python and face a strange problem:
When I have 50 txt files in a directory, I want to read each .txt file and save its content in a unique variable like:
**file = open(fcf[i], 'r')
text[i] = file.read()**
When I only read one file, it's ok:
count = 0
for file_flag in fcf:
if file_flag == 'feature.txt':
file = open(fcf[count], 'r')
features = file.read().split() # a list, word by word
count = count+1
However, to read txt files in a loop, it's wrong:
Below is my code and a very strange error comes up,
**text = np.zeros((np.shape(fcf)[0],1))
for flag in range(np.shape(fcf)[0]):
file = open(fcf[flag], 'r')
text = file.read() # string
file.close()**
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-41-7e544d88ee9d> in <module>()
2 for flag in range(np.shape(fcf)[0]):
3 file = open(fcf[flag], 'r')
----> 4 text = file.read() # string
5 file.close()
**UnicodeDecodeError: 'gbk' codec can't decode byte 0x94 in position 418: illegal multibyte sequence**
Update:
in a loop form:
file = open(fcf[flag], 'r', encoding='UTF-8')
the error also occurs:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 418: invalid start byte
Could anyone help me? Thank you very much!
Update2:
It seems that in these .txt files, most of them are in Unicode, which is durable for python. I find that, in notepad, there are 2 .txt file in ANSI encoding, which leads to this problem.
How could I read both ANSI and Unicode together in python?
Update3:
Thanks everyone. This problem is fixed.
There are 2 reasons for this problem:
some ANSI txt files are in overall UTF8 files.
some weird matches appears on ANSI encoding:
didn’t - didn抰 weren’t - weren抰, etc. (‘n -> 抰) ("Well - 揥ell)
although my PC is in English language totally, this problem still happens for ANSI txt. (manually modification is needed since notepad only change the encoding, not the above weird character...)
Hope it helps for other people facing the similar problem. Thx