Parsing non-ASCII characters in Python 3

Question

I'm trying to read from a file containing characters like é, Ä, etc. I'm using numpy.loadtxt() but I'm getting UnicodeDecodeErrors as the decoder cannot parse them. My first priority is to preserve those characters if at all possible but if not, I would not mind resorting to replacing them. Any suggestions?

`loadtxt` (and `genfromtxt`) work with bytestrings (for Py2 compatibility among other things). It may be simpler to write your own reader, or use the `csv` module, than to fight it. Make sure your `structured array` `dtype` uses unicode string type. — hpaulj, Feb 11 '16 at 19:30
Possibly related: http://stackoverflow.com/q/33001373/190597 — unutbu, Feb 11 '16 at 19:32
The csv module would definitely come in handy but the columns are tab-separated. When you say "write my own reader" you mean base it on standard python file I/O? — AutomEng, Feb 11 '16 at 19:36

hpaulj · Accepted Answer · 2016-02-11T20:06:29.070

2

In addition to the link the @unutbu found (using decode/encode in genfromtxt), here's a quick sketch of a direct file reader:

Sample file (utf8)

é, Ä
é, Ä
é, Ä

Readlines, split, and pass through np.array:

In [327]: fn='uni_csv.txt'
In [328]: with open(fn) as f:lines=f.readlines()
In [329]: lines
Out[329]: ['é, Ä\n', 'é, Ä\n', 'é, Ä\n']
...
In [331]: [l.strip().split(',') for l in lines]
Out[331]: [['é', ' Ä'], ['é', ' Ä'], ['é', ' Ä']]
In [332]: np.array([l.strip().split(',') for l in lines])
Out[332]: 
array([['é', ' Ä'],
       ['é', ' Ä'],
       ['é', ' Ä']], 
      dtype='<U2')

I don't think tab separation posses a problem (except that my text editor is set to replace tabs with spaces).

For mixed datatypes, I need to add a tuple conversion (a structured array definition requires a list of tuples):

In [343]: with open(fn) as f:lines=f.readlines()
In [344]: dt=np.dtype([('int',int),('é','|U2'),('Ä','U5')])
In [345]: np.array([tuple(l.strip().split(',')) for l in lines], dt)
Out[345]: 
array([(1, ' é', ' Ä'), (2, ' é', ' Ä'), (3, ' é', ' Ä')], 
      dtype=[('int', '<i4'), ('é', '<U2'), ('Ä', '<U5')])

(I added an integer column to my text file)

Actually loadtxt doesn't choke on this file and dtype either; it just loads the strings wrong.

In [349]: np.loadtxt('uni_csv.txt',dtype=dt, delimiter=',')
Out[349]: 
array([(1, "b'", "b' \\x"), (2, "b'", "b' \\x"), (3, "b'", "b' \\x")], 
      dtype=[('int', '<i4'), ('é', '<U2'), ('Ä', '<U5')])

edited Feb 11 '16 at 20:06

answered Feb 11 '16 at 19:41

hpaulj

221,503
14
230
353

The interpreter spits out the following problem `UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6908: ordinal not in range(128)` which means it still can't read certain characters in the file. My problem is that the souce file is encoded in UTF-8. It shouldn't, in theory, cause such problems, right? – AutomEng Feb 11 '16 at 19:59
I'm not sure where your code is raising this error. – hpaulj Feb 11 '16 at 20:07
Neither am I (it probably finds another non-ASCII character somewhere in the text) but apparently opening the file with `open("outfile.sql", encoding="utf-8")` and appending each line to a list for further processing works. It's a clunky workaround in my opinion but it will have to do. – AutomEng Feb 11 '16 at 20:09
I'm upvoting your answer for the handling of the different data types though, thanks for that. – AutomEng Feb 11 '16 at 20:12
Your answer worked by adding the optional `encoding=`. In other words the initial call to the file handler would be: `with open(fn, encoding="utf-8") as f:lines=f.readlines()` – AutomEng Feb 12 '16 at 03:15

Parsing non-ASCII characters in Python 3

1 Answers1