3

I have a text file which contains rows of information in the form of both strings, integers and floats, separated by white space, e.g.

HIP893    23_10    7     0.028      4
HIP1074  43_20    20   0.0141    1
HIP1325  23_10    7     0.02388  5
...

I've imported this data using the following line:

data=np.genfromtxt('98_info.txt', dtype=(object, object, int,float,float))

However when I do this I get an output of

[(b'HIP893', b'23_10', 7, 0.028, 4) 
 (b'HIP1074', b'43_20', 20, 0.0141, 1)
 (b'HIP1325', b'23_10', 7, 0.02388, 5)
  ... ]

Whereas I would like there to be no 'b' and instead:

[('HIP893', '23_10', 7, 0.028, 4.0) 
 ('HIP1074', '43_20', 20, 0.0141, 1.0)
 ('HIP1325', '23_10', 7, 0.02388, 5.0)
  ... ]

I have tried NumPy's core.defchararray but that gave me the error 'string operation on non-string array', I guess because my data is a combination of both strings and numbers maybe?

Is there some way to either remove the character but keep the data in an array or perhaps another way to load in the information that will keep the strings in quotation marks and the numbers without them?

If there is a way to import it in that form as a 2d np array even better, but that is not an issue if not.

Thanks!

qwerty
  • 105
  • 2
  • 8

3 Answers3

3

You can pass converters= with a function that decodes your bytes strings, eg:

convs = dict.fromkeys([0, 1], bytes.decode)
data = np.genfromtxt('98_info.txt', dtype=(object, object, int, float, float), converters=convs)

Which gives you data of:

array([('HIP893', '23_10',  7, 0.028  , 4.),
       ('HIP1074', '43_20', 20, 0.0141 , 1.),
       ('HIP1325', '23_10',  7, 0.02388, 5.)],
      dtype=[('f0', 'O'), ('f1', 'O'), ('f2', '<i8'), ('f3', '<f8'), ('f4', '<f8')])
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
2

With your sample and dtype:

In [1]: np.genfromtxt('stack55810419.txt', dtype=(object, object, int,float,floa
   ...: t))                                                                     
Out[1]: 
array([(b'HIP893', b'23_10',  7, 0.028  , 4.),
       (b'HIP1074', b'43_20', 20, 0.0141 , 1.),
       (b'HIP1325', b'23_10',  7, 0.02388, 5.)],
      dtype=[('f0', 'O'), ('f1', 'O'), ('f2', '<i8'), ('f3', '<f8'), ('f4', '<f8')])

With dtype=None (and encoding=None):

In [5]: np.genfromtxt('stack55810419.txt', dtype=None, encoding=None)           
Out[5]: 
array([('HIP893', 2310,  7, 0.028  , 4),
       ('HIP1074', 4320, 20, 0.0141 , 1),
       ('HIP1325', 2310,  7, 0.02388, 5)],
      dtype=[('f0', '<U7'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<f8'), ('f4', '<i8')])

Specifying unicode dtypes (have to include a size):

In [6]: np.genfromtxt('stack55810419.txt', dtype=('U7', 'U7', int,float,float)) 
Out[6]: 
array([('HIP893', '23_10',  7, 0.028  , 4.),
       ('HIP1074', '43_20', 20, 0.0141 , 1.),
       ('HIP1325', '23_10',  7, 0.02388, 5.)],
      dtype=[('f0', '<U7'), ('f1', '<U7'), ('f2', '<i8'), ('f3', '<f8'), ('f4', '<f8')])

I'm puzzled as to why the None case chooses a integer dtype for the 2nd column (the underscore should have prevented that).

dtype=None without the encoding parameter raises this warning:

/usr/local/bin/ipython3:1: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.

In Py2 the default string type is bytestrings; in Py3 unicode. genfromtxt has used bytestrings in compatibility with py2. But recent versions have added the encoding parameter. But there still seems to be some rough edges to that conversion.


This may be why I got i8; Python's own int accepts the underscore.

In [20]: int('23_10')                                                           
Out[20]: 2310
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • What is the conclusion of these observations? Are you saying that using `dtype='U7'` is the preferred way? – mkrieger1 Apr 24 '19 at 08:10
0

the string followed by b are encoded string i.e., in bytes

You can decode them by applying decode function or just str

newData = [(str(x) if isinstance(x,bytes) else x for x in y) for y in data]

I think You can convert it in nparray via this SO answer

I really don't know about nparray

vaku
  • 697
  • 8
  • 17
  • This will end up with a `list` though (not an `np.array`) containing tuples where *everything* has been converted to strings - even the items that are supposed to be `int` or `float`. – Jon Clements Apr 23 '19 at 11:56