0

everyone! I use python3(pycharm),and my codes are like these:

# -*- coding: utf-8 -*-
import numpy

c=numpy.loadtxt('test.csv',dtype="str_",delimiter=',',usecols=(6,),unpack=True)

when I have some Chinese words in test.csv,I got a error like this:

UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-5: ordinal not in range(256)

I have tried to encode the file,like this:

c=numpy.loadtxt('test.csv'.encode('utf-8'),dtype="str_",skiprows=0,delimiter=',',usecols=(6,),unpack=True)

And then,I got another error:

IndexError: list index out of range

Besides,the Chinese words in the file are longer than 64.

I have waste a lot of time on this,Please give me a hand!

Robin
  • 37
  • 1
  • 1
  • 6

2 Answers2

0
with open('test.csv', encoding='utf-8') as fh:
    numpy.loadtxt(fh, dtype="str_", delimiter=',', usecols=(6,), unpack=True)
Sergey Gornostaev
  • 7,596
  • 3
  • 27
  • 39
  • Now there's a new error:'UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 0: invalid continuation byte' – Robin Jul 27 '16 at 04:31
0

When we read the Chinese character to numpy, the data type cannot be a simple string because it treats the character as ASCII which is not long enough to hold the UTF-8 character.

So what I have done here is to let numpy knows we are reading a 4-byte character instead which is sufficient to hold an unicode character.

I have used the following sample data for testing:

1,2,3,4,5,6,7
一,二,三,四,五,六,七

Here is the code I have used:

# -*- coding: utf-8 -*-
import numpy
c=numpy.genfromtxt('test.csv',dtype="S4",delimiter=',',usecols=(6,),unpack=True)

for txt in c:
    print(txt.decode("utf-8"))

You can further check the below links to learn more:
1. How many bytes does one Unicode character take?
2. Numpy Data type objects

Community
  • 1
  • 1
Simon MC. Cheng
  • 291
  • 1
  • 6
  • In my csv,it can read 7,but can't read 七.And it comes out an error:`UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 0: invalid continuation byte` – Robin Jul 27 '16 at 04:38
  • Hi Robin, would you share your testing csv file content? – Simon MC. Cheng Jul 27 '16 at 05:48
  • By the way, have you saved your testing CSV in utf-8 format? – Simon MC. Cheng Jul 27 '16 at 05:49
  • @ Simon MC.Cheng.Hello,My csv's format is GB2312.So when I changed `print(txt.decode('utf-8')` to `print(txt.decode('GB2312')`,it works well!Thank you very much.And if I want to read a much longer string,I can change the dtype,am I right? – Robin Jul 27 '16 at 05:57
  • @Robin, yes, you can change the value of dtype to suit your needs. – Simon MC. Cheng Jul 27 '16 at 06:49