How to remove 'b' character from ndarray that is added by np.genfromtxt

Question

I have a text file which contains rows of information in the form of both strings, integers and floats, separated by white space, e.g.

HIP893 23_10 7 0.028 4
HIP1074 43_20 20 0.0141 1
HIP1325 23_10 7 0.02388 5
...

I've imported this data using the following line:

data=np.genfromtxt('98_info.txt', dtype=(object, object, int,float,float))

However when I do this I get an output of

[(b'HIP893', b'23_10', 7, 0.028, 4) 
 (b'HIP1074', b'43_20', 20, 0.0141, 1)
 (b'HIP1325', b'23_10', 7, 0.02388, 5)
  ... ]

Whereas I would like there to be no 'b' and instead:

[('HIP893', '23_10', 7, 0.028, 4.0) 
 ('HIP1074', '43_20', 20, 0.0141, 1.0)
 ('HIP1325', '23_10', 7, 0.02388, 5.0)
  ... ]

I have tried NumPy's core.defchararray but that gave me the error 'string operation on non-string array', I guess because my data is a combination of both strings and numbers maybe?

Is there some way to either remove the character but keep the data in an array or perhaps another way to load in the information that will keep the strings in quotation marks and the numbers without them?

If there is a way to import it in that form as a 2d np array even better, but that is not an issue if not.

Thanks!

the "b" character denotes a bytes sequence: https://stackoverflow.com/questions/6269765/what-does-the-b-character-do-in-front-of-a-string-literal — David Zemens, Apr 23 '19 at 11:45
That's just the representation of the data, not part of the data itself. — chepner, Apr 23 '19 at 16:27

score 3 · Answer 1 · answered Apr 23 '19 at 11:54

You can pass converters= with a function that decodes your bytes strings, eg:

convs = dict.fromkeys([0, 1], bytes.decode)
data = np.genfromtxt('98_info.txt', dtype=(object, object, int, float, float), converters=convs)

Which gives you data of:

array([('HIP893', '23_10',  7, 0.028  , 4.),
       ('HIP1074', '43_20', 20, 0.0141 , 1.),
       ('HIP1325', '23_10',  7, 0.02388, 5.)],
      dtype=[('f0', 'O'), ('f1', 'O'), ('f2', '<i8'), ('f3', '<f8'), ('f4', '<f8')])

hpaulj · Accepted Answer · 2019-04-23T16:25:51.657

With your sample and dtype:

In [1]: np.genfromtxt('stack55810419.txt', dtype=(object, object, int,float,floa
   ...: t))                                                                     
Out[1]: 
array([(b'HIP893', b'23_10',  7, 0.028  , 4.),
       (b'HIP1074', b'43_20', 20, 0.0141 , 1.),
       (b'HIP1325', b'23_10',  7, 0.02388, 5.)],
      dtype=[('f0', 'O'), ('f1', 'O'), ('f2', '<i8'), ('f3', '<f8'), ('f4', '<f8')])

With dtype=None (and encoding=None):

In [5]: np.genfromtxt('stack55810419.txt', dtype=None, encoding=None)           
Out[5]: 
array([('HIP893', 2310,  7, 0.028  , 4),
       ('HIP1074', 4320, 20, 0.0141 , 1),
       ('HIP1325', 2310,  7, 0.02388, 5)],
      dtype=[('f0', '<U7'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<f8'), ('f4', '<i8')])

Specifying unicode dtypes (have to include a size):

In [6]: np.genfromtxt('stack55810419.txt', dtype=('U7', 'U7', int,float,float)) 
Out[6]: 
array([('HIP893', '23_10',  7, 0.028  , 4.),
       ('HIP1074', '43_20', 20, 0.0141 , 1.),
       ('HIP1325', '23_10',  7, 0.02388, 5.)],
      dtype=[('f0', '<U7'), ('f1', '<U7'), ('f2', '<i8'), ('f3', '<f8'), ('f4', '<f8')])

I'm puzzled as to why the None case chooses a integer dtype for the 2nd column (the underscore should have prevented that).

dtype=None without the encoding parameter raises this warning:

/usr/local/bin/ipython3:1: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.

In Py2 the default string type is bytestrings; in Py3 unicode. genfromtxt has used bytestrings in compatibility with py2. But recent versions have added the encoding parameter. But there still seems to be some rough edges to that conversion.

This may be why I got i8; Python's own int accepts the underscore.

In [20]: int('23_10')                                                           
Out[20]: 2310

What is the conclusion of these observations? Are you saying that using `dtype='U7'` is the preferred way? — mkrieger1, Apr 24 '19 at 08:10

vaku · Answer 3 · 2019-04-23T12:10:01.363

0

the string followed by b are encoded string i.e., in bytes

You can decode them by applying decode function or just str

newData = [(str(x) if isinstance(x,bytes) else x for x in y) for y in data]

I think You can convert it in nparray via this SO answer

I really don't know about nparray

edited Apr 23 '19 at 12:10

answered Apr 23 '19 at 11:46

vaku

697
8
17

This will end up with a `list` though (not an `np.array`) containing tuples where *everything* has been converted to strings - even the items that are supposed to be `int` or `float`. – Jon Clements Apr 23 '19 at 11:56

How to remove 'b' character from ndarray that is added by np.genfromtxt

3 Answers3