I have a pandas dataframe that has a column that contains tuples made up of two floats e.g. (1.1,2.2). I want to be able to produce an array that contains the first element of each tuple. I could step through each row and get the first element of each tuple but the dataframe contains almost 4 million records and such an approach is very slow. An answer by satoru on SO (stackoverflow.com/questions/6454894/reference-an-element-in-a-list-of-tuples) suggests using the following mechanism:
>>> import numpy as np
>>> arr = np.array([(1.1,2.2),(3.3,4.4),(5.5,6.6),(7.7,8.8)])
>>> arr
array([[ 1.1, 2.2],
[ 3.3, 4.4],
[ 5.5, 6.6],
[ 7.7, 8.8]])
>>> arr[:,0]
array([ 1.1, 3.3, 5.5, 7.7])
So that works fine and would be absolutely perfect for my needs. However, the problem I have occurs when I try to create a numpy array from a pandas dataframe. In that case, the above solution fails with a variety of errors. For example:
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({'other':[0,0,0,1,1],'point':[(1.1,2.2),(3.3,4.4),(5.5,6.6),(7.7,8.8),(9.9,0.0)]})
>>> df
other point
0 0 (1.1, 2.2)
1 0 (3.3, 4.4)
2 0 (5.5, 6.6)
3 1 (7.7, 8.8)
4 1 (9.9, 0.0)
>>> arr2 = np.array(df['point'])
>>> arr2
array([(1.1, 2.2), (3.3, 4.4), (5.5, 6.6), (7.7, 8.8), (9.9, 0.0)], dtype=object)
>>> arr2[:,0]
IndexError: too many indices for array
Alternatively:
>>> arr2 = np.array([df['point']])
>>> arr2
array([[[1.1, 2.2],
[3.3, 4.4],
[5.5, 6.6],
[7.7, 8.8],
[9.9, 0.0]]], dtype=object)
>>> arr2[:,0]
array([[1.1, 2.2]], dtype=object) # Which is not what I want!
Something seems to be going wrong when I transfer data from the pandas dataframe to a numpy array - but I've no idea what. Any suggestions would be gratefully received.