I'm trying to read from a file containing characters like é, Ä, etc. I'm using numpy.loadtxt() but I'm getting UnicodeDecodeErrors as the decoder cannot parse them. My first priority is to preserve those characters if at all possible but if not, I would not mind resorting to replacing them. Any suggestions?
Asked
Active
Viewed 810 times
1
-
`loadtxt` (and `genfromtxt`) work with bytestrings (for Py2 compatibility among other things). It may be simpler to write your own reader, or use the `csv` module, than to fight it. Make sure your `structured array` `dtype` uses unicode string type. – hpaulj Feb 11 '16 at 19:30
-
Possibly related: http://stackoverflow.com/q/33001373/190597 – unutbu Feb 11 '16 at 19:32
-
The csv module would definitely come in handy but the columns are tab-separated. When you say "write my own reader" you mean base it on standard python file I/O? – AutomEng Feb 11 '16 at 19:36
1 Answers
2
In addition to the link the @unutbu found (using decode/encode in genfromtxt
), here's a quick sketch of a direct file reader:
Sample file (utf8)
é, Ä
é, Ä
é, Ä
Readlines, split, and pass through np.array
:
In [327]: fn='uni_csv.txt'
In [328]: with open(fn) as f:lines=f.readlines()
In [329]: lines
Out[329]: ['é, Ä\n', 'é, Ä\n', 'é, Ä\n']
...
In [331]: [l.strip().split(',') for l in lines]
Out[331]: [['é', ' Ä'], ['é', ' Ä'], ['é', ' Ä']]
In [332]: np.array([l.strip().split(',') for l in lines])
Out[332]:
array([['é', ' Ä'],
['é', ' Ä'],
['é', ' Ä']],
dtype='<U2')
I don't think tab
separation posses a problem (except that my text editor is set to replace tabs with spaces).
For mixed datatypes, I need to add a tuple
conversion (a structured array definition requires a list of tuples):
In [343]: with open(fn) as f:lines=f.readlines()
In [344]: dt=np.dtype([('int',int),('é','|U2'),('Ä','U5')])
In [345]: np.array([tuple(l.strip().split(',')) for l in lines], dt)
Out[345]:
array([(1, ' é', ' Ä'), (2, ' é', ' Ä'), (3, ' é', ' Ä')],
dtype=[('int', '<i4'), ('é', '<U2'), ('Ä', '<U5')])
(I added an integer column to my text file)
Actually loadtxt
doesn't choke on this file and dtype either; it just loads the strings wrong.
In [349]: np.loadtxt('uni_csv.txt',dtype=dt, delimiter=',')
Out[349]:
array([(1, "b'", "b' \\x"), (2, "b'", "b' \\x"), (3, "b'", "b' \\x")],
dtype=[('int', '<i4'), ('é', '<U2'), ('Ä', '<U5')])

hpaulj
- 221,503
- 14
- 230
- 353
-
The interpreter spits out the following problem `UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6908: ordinal not in range(128)` which means it still can't read certain characters in the file. My problem is that the souce file is encoded in UTF-8. It shouldn't, in theory, cause such problems, right? – AutomEng Feb 11 '16 at 19:59
-
-
Neither am I (it probably finds another non-ASCII character somewhere in the text) but apparently opening the file with `open("outfile.sql", encoding="utf-8")` and appending each line to a list for further processing works. It's a clunky workaround in my opinion but it will have to do. – AutomEng Feb 11 '16 at 20:09
-
I'm upvoting your answer for the handling of the different data types though, thanks for that. – AutomEng Feb 11 '16 at 20:12
-
Your answer worked by adding the optional `encoding=`. In other words the initial call to the file handler would be: `with open(fn, encoding="utf-8") as f:lines=f.readlines()` – AutomEng Feb 12 '16 at 03:15