3

I have imported a text file into a numpy array as shown below.

data=np.genfromtxt(f,dtype=None,delimiter=',',names=None)

where f contains the path of my csv file

now data contains the following.

array([(534, 116.48482, 39.89821, '2008-02-03 00:00:49'),
   (650, 116.4978, 39.98097, '2008-02-03 00:00:02'),
   (675, 116.31873, 39.9374, '2008-02-03 00:00:04'),
   (715, 116.70027, 40.16545, '2008-02-03 00:00:45'),
   (2884, 116.67727, 39.88201, '2008-02-03 00:00:48'),
   (3799, 116.29838, 40.04533, '2008-02-03 00:00:37'),
   (4549, 116.48405, 39.91403, '2008-02-03 00:00:42'),
   (4819, 116.42967, 39.93963, '2008-02-03 00:00:43')],
    dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8'), ('f3', 'S19')])

If i now try to column slice, ie extract the first or the second column using

data[:,0]

It says "too many indices". I figured out that it is due the the way it is being stored. all the rows are being stored as tuples and not as list/array. I thought of using the "ugliest" way possible to perform slicing without having to use iteration. That would be to convert the tuples in each row to list and put it back to the numpy array. something like this

data=np.asarray([list(i) for i in data])

But for the above problem, i am losing the datatypes of each column. Each element will be stored as a string rather than integer or float which was automatically detected in the former case.

Now if i want to slice the columns without having to use iteration is there any way?

user2179627
  • 367
  • 1
  • 4
  • 15

2 Answers2

5

What np.genfromtext has created for you is not an array of tuples, which would have had object dtype, but a record array. You can tell from the weird dtype:

dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8'), ('f3', 'S19')]

Each of the tuples in that list holds the name of the corresponding field, and its dtype, <i4 is a little endian 4 byte integer, <f8 a little endian 8 byte float and S19 a 19 character long string. You can access the fields by name as:

In [2]: x['f0']
Out[2]: array([ 534,  650,  675,  715, 2884, 3799, 4549, 4819])

In [3]: x['f1']
Out[3]: 
array([ 116.48482,  116.4978 ,  116.31873,  116.70027,  116.67727,
        116.29838,  116.48405,  116.42967])
Jaime
  • 65,696
  • 17
  • 124
  • 159
2

Perhaps for your case you could just use zip.

import numpy as np

x = np.array([(534, 116.48482, 39.89821, '2008-02-03 00:00:49'),
              (650, 116.4978, 39.98097, '2008-02-03 00:00:02'),
              (675, 116.31873, 39.9374, '2008-02-03 00:00:04'),
              (715, 116.70027, 40.16545, '2008-02-03 00:00:45'),
              (2884, 116.67727, 39.88201, '2008-02-03 00:00:48'),
              (3799, 116.29838, 40.04533, '2008-02-03 00:00:37'),
              (4549, 116.48405, 39.91403, '2008-02-03 00:00:42'),
              (4819, 116.42967, 39.93963, '2008-02-03 00:00:43')],
              dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8'), ('f3', 'S19')])

b = zip(*x)

Result:

>>> b[0]
(534, 650, 675, 715, 2884, 3799, 4549, 4819)
>>> b[1]
(116.48482, 116.4978, 116.31873, 116.70027, 116.67726999999999, 116.29837999999999, 116.48405, 116.42967)
Akavall
  • 82,592
  • 51
  • 207
  • 251
  • 1
    What he has is a [record array](http://docs.scipy.org/doc/numpy/user/basics.rec.html), so it does have different dtypes. – Jaime Apr 21 '13 at 21:17
  • @Jaime, My whole comment about `numpy.arrays` not being able to hold different datatypes is wrong, and `record array` is not the only case. For example `a = np.array([{'a' : 5}, 'g', 5])` will hold 3 different datatypes (or technically `dtype=object`). So the best I could do is delete that statement. Thanks for pointing me in the right direction! As for the rest of my answer, well your answer is obviously better. – Akavall Apr 22 '13 at 00:03