Loading UTF-8 file in Python 3 using numpy.genfromtxt

Question

I have a CSV file that I downloaded from WHO site (http://apps.who.int/gho/data/view.main.52160 , Downloads, "multipurpose table in CSV format"). I try to load the file into a numpy array. Here's my code:

import numpy
#U75 - unicode string of max. length 75
world_alcohol = numpy.genfromtxt("xmart.csv", dtype="U75", skip_header=2, delimiter=",")
print(world_alcohol)

And I get

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128).

I guess that numpy has a problem reading the string "Côte d'Ivoire". The file is properly encoded UTF-8 (according to my text editor). I am using Python 3.4.3 and numpy 1.9.2.

What am I doing wrong? How can I read the file into numpy?

hpaulj · Accepted Answer · 2021-07-04T14:39:51.393

Note the original 2015 date. Since then genfromtxt has gotten an encoding parameter.

In Python3 I can do:

In [224]: txt = "Côte d'Ivoire"
In [225]: x = np.zeros((2,),dtype='U20')
In [226]: x[0] = txt
In [227]: x
Out[227]: 
array(["Côte d'Ivoire", ''],   dtype='<U20')

Which means I probably could open a 'UTF-8' file (regular, not byte mode), and readlines, and assign them to elements of an array like x.

But genfromtxt insists on operating with byte strings (ascii) which can't handle the larger UTF-8 set (7 bytes v 8). So I need to apply decode at some point to get an U array.

I can load it into a 'S' array with genfromtxt:

In [258]: txt="Côte d'Ivoire"
In [259]: a=np.genfromtxt([txt.encode()],delimiter=',',dtype='S20')
In [260]: a
Out[260]: 
array(b"C\xc3\xb4te d'Ivoire",  dtype='|S20')

and apply decode to individual elements:

In [261]: print(a.item().decode())
Côte d'Ivoire

In [325]: print _
Côte d'Ivoire

Or use np.char.decode to apply it to each element of an array:

In [263]: np.char.decode(a)
Out[263]: 
array("Côte d'Ivoire", dtype='<U13')
In [264]: print(_)
Côte d'Ivoire

genfromtxt lets me specify converters:

In [297]: np.genfromtxt([txt.encode()],delimiter=',',dtype='U20',
    converters={0:lambda x: x.decode()})
Out[297]: 
array("Côte d'Ivoire", dtype='<U20')

If the csv has a mix of strings and numbers, this converters approach will be easier to use than the np.char.decode. Just specify the converter for each string column.

(See my earlier edits for Python2 tries).

Not OP but thanks for the clear and useful buildup of the answer. — KobeJohn, Oct 07 '15 at 22:39
Thank you for the answer. It works! I am just beginning with Python and I find it weird that numpy can't read UTF-8 out-of-the-box. I have read that Python is easy and developed with simplicity and ease of use in mind yet reading UTF-8 requires additional convertion? I thought we are living in 2015. — JustAC0der, Oct 08 '15 at 08:02

Loading UTF-8 file in Python 3 using numpy.genfromtxt

1 Answers1

Linked

Related