UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-5: ordinal not in range(256)

Question

everyone! I use python3(pycharm),and my codes are like these:

# -*- coding: utf-8 -*-
import numpy

c=numpy.loadtxt('test.csv',dtype="str_",delimiter=',',usecols=(6,),unpack=True)

when I have some Chinese words in test.csv,I got a error like this:

UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-5: ordinal not in range(256)

I have tried to encode the file,like this:

c=numpy.loadtxt('test.csv'.encode('utf-8'),dtype="str_",skiprows=0,delimiter=',',usecols=(6,),unpack=True)

And then,I got another error:

IndexError: list index out of range

Besides,the Chinese words in the file are longer than 64.

I have waste a lot of time on this,Please give me a hand！

score 0 · Answer 1 · answered Jul 27 '16 at 04:23

0

with open('test.csv', encoding='utf-8') as fh:
    numpy.loadtxt(fh, dtype="str_", delimiter=',', usecols=(6,), unpack=True)

answered Jul 27 '16 at 04:23

Sergey Gornostaev

7,596
3
27
39

Now there's a new error:'UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 0: invalid continuation byte' – Robin Jul 27 '16 at 04:31

score 0 · Accepted Answer · edited May 23 '17 at 12:32

0

When we read the Chinese character to numpy, the data type cannot be a simple string because it treats the character as ASCII which is not long enough to hold the UTF-8 character.

So what I have done here is to let numpy knows we are reading a 4-byte character instead which is sufficient to hold an unicode character.

I have used the following sample data for testing:

1,2,3,4,5,6,7
一,二,三,四,五,六,七

Here is the code I have used:

# -*- coding: utf-8 -*-
import numpy
c=numpy.genfromtxt('test.csv',dtype="S4",delimiter=',',usecols=(6,),unpack=True)

for txt in c:
    print(txt.decode("utf-8"))

You can further check the below links to learn more:
1. How many bytes does one Unicode character take?
2. Numpy Data type objects

edited May 23 '17 at 12:32

Community

1
1

answered Jul 27 '16 at 04:29

Simon MC. Cheng

291
1
6

In my csv,it can read 7,but can't read 七.And it comes out an error:`UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 0: invalid continuation byte` – Robin Jul 27 '16 at 04:38
Hi Robin, would you share your testing csv file content? – Simon MC. Cheng Jul 27 '16 at 05:48
By the way, have you saved your testing CSV in utf-8 format? – Simon MC. Cheng Jul 27 '16 at 05:49
@ Simon MC.Cheng.Hello,My csv's format is GB2312.So when I changed `print(txt.decode('utf-8')` to `print(txt.decode('GB2312')`,it works well!Thank you very much.And if I want to read a much longer string,I can change the dtype,am I right? – Robin Jul 27 '16 at 05:57
@Robin, yes, you can change the value of dtype to suit your needs. – Simon MC. Cheng Jul 27 '16 at 06:49

UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-5: ordinal not in range(256)

2 Answers2

Linked