3

I have a pandas dataframe that has a column that contains tuples made up of two floats e.g. (1.1,2.2). I want to be able to produce an array that contains the first element of each tuple. I could step through each row and get the first element of each tuple but the dataframe contains almost 4 million records and such an approach is very slow. An answer by satoru on SO (stackoverflow.com/questions/6454894/reference-an-element-in-a-list-of-tuples) suggests using the following mechanism:

>>> import numpy as np
>>> arr = np.array([(1.1,2.2),(3.3,4.4),(5.5,6.6),(7.7,8.8)])
>>> arr
array([[ 1.1,  2.2],
       [ 3.3,  4.4],
       [ 5.5,  6.6],
       [ 7.7,  8.8]])
>>> arr[:,0]
array([ 1.1,  3.3,  5.5,  7.7])

So that works fine and would be absolutely perfect for my needs. However, the problem I have occurs when I try to create a numpy array from a pandas dataframe. In that case, the above solution fails with a variety of errors. For example:

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({'other':[0,0,0,1,1],'point':[(1.1,2.2),(3.3,4.4),(5.5,6.6),(7.7,8.8),(9.9,0.0)]})
>>> df
   other       point
0      0  (1.1, 2.2)
1      0  (3.3, 4.4)
2      0  (5.5, 6.6)
3      1  (7.7, 8.8)
4      1  (9.9, 0.0)
>>> arr2 = np.array(df['point'])
>>> arr2
array([(1.1, 2.2), (3.3, 4.4), (5.5, 6.6), (7.7, 8.8), (9.9, 0.0)], dtype=object)
>>> arr2[:,0]
IndexError: too many indices for array

Alternatively:

>>> arr2 = np.array([df['point']])
>>> arr2
array([[[1.1, 2.2],
        [3.3, 4.4],
        [5.5, 6.6],
        [7.7, 8.8],
        [9.9, 0.0]]], dtype=object)
>>> arr2[:,0]
array([[1.1, 2.2]], dtype=object)   # Which is not what I want!

Something seems to be going wrong when I transfer data from the pandas dataframe to a numpy array - but I've no idea what. Any suggestions would be gratefully received.

user1718097
  • 4,090
  • 11
  • 48
  • 63

2 Answers2

3

Starting with your dataframe, I can extract a (5,2) array with:

In [68]: df=pandas.DataFrame({'other':[0,0,0,1,1],'point':[(1.1,2.2),(3.3,4.4),(5.5,6.6),(7.7,8.8),(9.9,0.0)]})

In [69]: np.array(df['point'].tolist())
Out[69]: 
array([[ 1.1,  2.2],
       [ 3.3,  4.4],
       [ 5.5,  6.6],
       [ 7.7,  8.8],
       [ 9.9,  0. ]])

df['point'] is a Pandas series.

df['point'].values returns an array of shape (5,), and dtype object. I

array([(1.1, 2.2), (3.3, 4.4), (5.5, 6.6), (7.7, 8.8), (9.9, 0.0)], dtype=object)

It is, in effect, an array of tuples. Real tuples, not the structured array tuple-look-a-likes. The array actually contains pointers to the tuples, which are else where in memory. Its shape is (5,) - it's a 1d array, so trying to index as though it were 2d will give you the 'too many' error. np.array([df['point']]) just wraps it in another dimension, without addressing the fundamental object dtype issue.

tolist() converts it to a list of tuples, from which you can build the 2d array.

Copying data from arrays of objects to n-d arrays is not trivial, and invariably requires some sort of copying. The data buffers are entirely different, so things like astype don't work.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
0
import numpy as np
import pandas as pd
df = pd.DataFrame({'other':[0,0,0,1,1],'point':[(1.1,2.2),(3.3,4.4),(5.5,6.6),(7.7,8.8),(9.9,0.0)]})
array = df['point'].apply(lambda x: x[0]).values
array
# array([ 1.1,  3.3,  5.5,  7.7,  9.9])
grechut
  • 2,897
  • 1
  • 19
  • 18
  • Thanks for that solution. That would certainly produce the desired output. However, it doesn't really address the question as to why importing data from a dataframe into a numpy array doesn't work. – user1718097 Mar 25 '15 at 00:23