how to read both ANSI and Unicode txt files in Python?

Question

I'm new to python and face a strange problem:

When I have 50 txt files in a directory, I want to read each .txt file and save its content in a unique variable like:

**file = open(fcf[i], 'r')
text[i] = file.read()**

When I only read one file, it's ok:

count = 0
for file_flag in fcf:
    if file_flag == 'feature.txt':
        file = open(fcf[count], 'r')
        features = file.read().split() # a list, word by word
    count = count+1

However, to read txt files in a loop, it's wrong:

Below is my code and a very strange error comes up,

**text = np.zeros((np.shape(fcf)[0],1))
for flag in range(np.shape(fcf)[0]):
    file = open(fcf[flag], 'r')
    text = file.read() # string
    file.close()**

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-41-7e544d88ee9d> in <module>()
      2 for flag in range(np.shape(fcf)[0]):
      3     file = open(fcf[flag], 'r')
----> 4     text = file.read() # string
      5     file.close()

**UnicodeDecodeError: 'gbk' codec can't decode byte 0x94 in position 418: illegal multibyte sequence**

Update:

in a loop form:

file = open(fcf[flag], 'r', encoding='UTF-8')

the error also occurs:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 418: invalid start byte

Could anyone help me? Thank you very much!

Update2:

It seems that in these .txt files, most of them are in Unicode, which is durable for python. I find that, in notepad, there are 2 .txt file in ANSI encoding, which leads to this problem.

How could I read both ANSI and Unicode together in python?

Update3:

Thanks everyone. This problem is fixed.

There are 2 reasons for this problem:

some ANSI txt files are in overall UTF8 files.
some weird matches appears on ANSI encoding:

didn’t - didn抰 weren’t - weren抰, etc. (‘n -> 抰) ("Well - 揥ell)

although my PC is in English language totally, this problem still happens for ANSI txt. (manually modification is needed since notepad only change the encoding, not the above weird character...)

Hope it helps for other people facing the similar problem. Thx

Possible duplicate of [UnicodeDecodeError when redirecting to file](https://stackoverflow.com/questions/4545661/unicodedecodeerror-when-redirecting-to-file)? Also many more related questions just when [searching the site](https://stackoverflow.com/search?q=UnicodeDecodeError). — Kevin, Mar 04 '18 at 08:15
hi Kevin, the link you offers looks confusing... could you help me more? — Xiao Guo, Mar 04 '18 at 08:22
Try `open(fcf[flag], 'r', encoding='UTF-8')`. [Source](https://stackoverflow.com/a/20994462/8811872) — Kevin, Mar 04 '18 at 08:28
Thanks Kevin! a similar error comes up again. I update more details on the question. When I read a single txt file, it's ok. It gives me an error when I use a loop to read multiple files. — Xiao Guo, Mar 04 '18 at 08:38
You need some way to determine the encoding. Do the UTF8 files all have a BOM? — David Heffernan, Mar 04 '18 at 08:45
Thanks David, I guess I find the problem but still need help. Among many Unicode txt files, there are several ANSI txt files... how could I write in an intelligent way to recognize whether it's Unicode or ANSI and encode them? Thx!! — Xiao Guo, Mar 04 '18 at 08:48
ohhh, I get your question. I check them in notepad.. there are UTF8, no boom — Xiao Guo, Mar 04 '18 at 09:02
Thank you. In my problem today, I have to manually fix them. Anyway, thx. — Xiao Guo, Mar 04 '18 at 09:29
Python allows catching errors. Use UTF8 throughout except on files that throw an error. Catch it and use that other encoding. — Jongware, Mar 04 '18 at 09:37
hi, thanks for your information, maybe you could share more details? thx! — Xiao Guo, Mar 04 '18 at 09:49
In case that comment was targeted to me: sure. See the official Python tutorial on [Catching Exceptions](https://docs.python.org/3/tutorial/errors.html). — Jongware, Mar 04 '18 at 10:54

Thierry Lathuille · Answer 1 · 2018-03-04T08:40:25.580

1

You open your file in default text mode. When reading it, Python tries to decode it, using the default encoding for your platform, which seems to be 'gbk'. Apparently, the file you're trying to read uses another encoding, which causes this error.

You have to indicate the encoding to use in open. If it is 'UTF-8', for example:

file = open(fcf[flag], 'r', encoding='UTF-8')

If your file uses a different encoding, you must figure it out first, I don't know what is common in your part of the world. You can have a look at the list of standard encodings.

For chinese, the listed encodings are 'gb2312', 'gb18030', 'hz', you could try with these ones.

edited Mar 04 '18 at 08:40

answered Mar 04 '18 at 08:31

Thierry Lathuille

23,663
10
44
50

Thank you Thierry! the same problem appearsT _ T could you figure out why? UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 418: invalid start byte; – Xiao Guo Mar 04 '18 at 08:33
Your file is not encoded in UTF8 either, check the updated answer. – Thierry Lathuille Mar 04 '18 at 08:41
Thanks Thierry! My PC is totally in English and all txt are in English. I find that 2 files among the directory are in ANSI encoding, which gives the error. how could I read ANSI... the error says "unknown encoding: ANSI" T_T – Xiao Guo Mar 04 '18 at 08:44

how to read both ANSI and Unicode txt files in Python?

1 Answers1