Numpy genfromtxt reads additional unwanted strings

Question

I want to read a txt file using numpy's genfromtxt. The file t.txt looks as follows:

###############
PSZ1 G096.89+24.17
PSZ1 G108.18−11.53
RXC J0225.1−2928
RXC J1053.7+5452
RXC J1234.2+0947
RXC J1314.4−2515
S 1081
ZwCl 0008.8+5215
ZwCl 2341+0000
1E 0657−558
1RXS J0603.3+4214
24P 73

I import numpy and run genfromtxt as follows:

import numpy as np
a =np.genfromtxt("t.txt", comments="#", dtype=None,autostrip=True,delimiter = " ")

and that returns the following when issuing print a:

array([['PSZ1', 'G096.89+24.17'],
       ['PSZ1', 'G108.18\xe2\x88\x9211.53'],
       ['RXC', 'J0225.1\xe2\x88\x922928'],
       ['RXC', 'J1053.7+5452'],
       ['RXC', 'J1234.2+0947'],
       ['RXC', 'J1314.4\xe2\x88\x922515'],
       ['S', '1081'],
       ['ZwCl', '0008.8+5215'],
       ['ZwCl', '2341+0000'],
       ['1E', '0657\xe2\x88\x92558'],
       ['1RXS', 'J0603.3+4214'],
       ['24P', '73']], 
      dtype='|S15')

I would like to know what causes the additional stings containing \x and how to get ride of them, while still using genfromtxt.

Further, many other methods of reading strings return the same problem (the additional \x strings), even when directly copying the example from this post (t.txt) directly to a txt or csv file.

I created the file t.txt in the atom editor, which says in the bottom UTF8. I also saved the file again as UTF8.

How can I properly read the falsely encoded + and - signs in python without changing them individually by hand?

Thanks

Possible duplicate of [Loading UTF-8 file in Python 3 using numpy.genfromtxt](http://stackoverflow.com/questions/33001373/loading-utf-8-file-in-python-3-using-numpy-genfromtxt) — Kennet Celeste, Oct 20 '16 at 16:03
I am using python 2.7 and I am not receiving an error message. Also, There are no fancy letters in my txt file (as far as I can tell). — user3063903, Oct 20 '16 at 16:05
Looks like in the encoding, minus sign is not being translated and instead being replaced with UTF-8 code for it "\xe2\x88\x92". It should be related to UTF-8 loading. There is no error but the solution in the link provided by @yugi should help. — oxtay, Oct 20 '16 at 16:09
Ah ok, the minus is the problem. Thanks, I did not recognise that... — user3063903, Oct 20 '16 at 16:11
FYI: `'\xe2\x88\x92'` is the UTF-8 encoding of the [unicode character 'MINUS SIGN' (U+2212)](http://www.fileformat.info/info/unicode/char/2212/index.htm). That's not the regular minus sign '-', which is '\x2d'. — Warren Weckesser, Oct 20 '16 at 16:31

score 0 · Accepted Answer · edited May 23 '17 at 12:00

In Py3 Ipython session:

In [847]: data=np.genfromtxt('stack40159019.txt',comments='#',dtype=None)
In [848]: data
Out[848]: 
array([[b'PSZ1', b'G096.89+24.17'],
       [b'PSZ1', b'G108.18\xe2\x88\x9211.53'],
       [b'RXC', b'J0225.1\xe2\x88\x922928'],
       [b'RXC', b'J1053.7+5452'],
       [b'RXC', b'J1234.2+0947'],
       [b'RXC', b'J1314.4\xe2\x88\x922515'],
       [b'S', b'1081'],
       [b'ZwCl', b'0008.8+5215'],
       [b'ZwCl', b'2341+0000'],
       [b'1E', b'0657\xe2\x88\x92558'],
       [b'1RXS', b'J0603.3+4214'],
       [b'24P', b'73']], 
      dtype='|S15')
In [849]: np.char.decode(data)
Out[849]: 
array([['PSZ1', 'G096.89+24.17'],
       ['PSZ1', 'G108.18−11.53'],
       ['RXC', 'J0225.1−2928'],
       ['RXC', 'J1053.7+5452'],
       ['RXC', 'J1234.2+0947'],
       ['RXC', 'J1314.4−2515'],
       ['S', '1081'],
       ['ZwCl', '0008.8+5215'],
       ['ZwCl', '2341+0000'],
       ['1E', '0657−558'],
       ['1RXS', 'J0603.3+4214'],
       ['24P', '73']], 
      dtype='<U13')

The suggested duplicate is for a Py3, but I think this decode approach will work on Py2 as well.

Loading UTF-8 file in Python 3 using numpy.genfromtxt

=====================

In a Py2 session:

>>> txt=b'G108.18\xe2\x88\x9211.53'
>>> txt
'G108.18\xe2\x88\x9211.53'
>>> txt.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 7: ordinal not in range(128)
>>> txt.decode('UTF-8')
u'G108.18\u221211.53'

>>> txt.decode(errors='replace')
u'G108.18\ufffd\ufffd\ufffd11.53'
>>> txt.decode(errors='ignore')
u'G108.1811.53'

and to replace the unicode -- with ascii -:

>>> '\xe2\x88\x92'.decode('utf8')
u'\u2212'
>>> txt.decode('utf8').replace(u'\u2212','-')
u'G108.18-11.53'
>>> txt.decode('utf8').replace(u'\u2212','-').encode()
'G108.18-11.53'

So with np.char.replace (back in py3)

In [872]: np.char.replace(data1,u'\u2212','-').astype('S13')
Out[872]: 
array([[b'PSZ1', b'G096.89+24.17'],
       [b'PSZ1', b'G108.18-11.53'],
       [b'RXC', b'J0225.1-2928'],
       [b'RXC', b'J1053.7+5452'],
       [b'RXC', b'J1234.2+0947'],
       [b'RXC', b'J1314.4-2515'],
       [b'S', b'1081'],
       [b'ZwCl', b'0008.8+5215'],
       [b'ZwCl', b'2341+0000'],
       [b'1E', b'0657-558'],
       [b'1RXS', b'J0603.3+4214'],
       [b'24P', b'73']], 
      dtype='|S13')

In python 2.7 I get the following error : UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 7: ordinal not in range(128) — user3063903, Oct 20 '16 at 18:46
I get that error when I try `data.astype('U20')`, but not with the `np.char` function. I wonder if an explicit `encodeing` would help; `np.char.decode(data,'UTF8')` — hpaulj, Oct 20 '16 at 19:36

Numpy genfromtxt reads additional unwanted strings

1 Answers1