I have a dataframe containing a column vector
with around 500 000 rows of array vector. What I'm trying to do is unloading the content of this column into a 2 dimension array but I don't know the fastest way to do it.
This is the format of the array I'm trying to obtain ([1, 2], [3, 4]
and [5, 6]
are array contained in my dataframe):
array([[1, 2],
[3, 4],
[5, 6]])
I tried to_numpy
, as_matrix
, and .values
but it gives me a 1D array which is not what I'm looking for:
array([array([1, 2]),
array([3, 4]),
array([5, 6])])
The only methods which gave me the result I want are np.asarray()
and np.array()
but they take too much time in my case.
What I want is the same array I obtain with using numpy array methods (vector1,2 and 8) but faster if possible because it takes too much time when we have lot of data.
Thank you for your help !
edit : Here is my function which does the following: it takes in parameter a dataframe which which contains two columns : id and vectors which is a serie of array objects.
id vectors
1 array([1,2,3], dtype='float32')
2 array([3,4,5], dtype='float32')
3 array([6,7,8], dtype='float32')
[11530 rows x 2 columns]
What i want to do with this function is unloading the content of column id in a list which is fast and easy and the content of column vectors into an array. So i want a 2 dimensional array of array vectors.
def filter_df(df, request):
start = time.time()
filtered_df = df
ids = filtered_df['id'].tolist()
filtered_df_vectors = filtered_df['vectors'].tolist()
vectors9 = np.array(filtered_df['vectors'].tolist())
vectors1 = np.asarray(filtered_df_vectors)
vectors2 = np.array([f for f in filtered_df_vectors],dtype=np.float32)
vectors3 = filtered_df['vectors'].as_matrix()
vectors4 = filtered_df['vectors'].to_numpy()
vectors5 = filtered_df['vectors'].values
vectors6 = filtered_df.iloc[:,-1].values
vectors8 = np.array(filtered_df['vectors'].values.tolist())
vectors9 = np.array(filtered_df['vectors'].tolist())
filter_duration= time.time()-start
logger.info(f"duration: {filter_duration}s")
return ids,vectors2,filter_duration
I can't copy paste the exact output it returns me for the resulted arrays because it will be unreadable for you so i will just show the two type of array i obtain with the multiple methods i tested.
For vectors 1, 2, 8 and 9 where i use numpy methods, i obtain this format which is the one i'm looking for but it takes two much time (around 0,7 second which is too slow for my case). I wont copy paste the exact array i obtain because it will be unreadable for you. Know just that [1,2,3] represent Here is what i obtain :
array([[1,2,3],
[4,5,6],
[7,8,9]], dtype=float32)
ndim : 2
dtype('float32')
shape : (11530, 300)
size : 3459000
For vectors 3, 4, 5 and 6 where i use no numpy methods like pandas to_numpy or as_matrix are fast (~0.05 sec) but returns me with the same entry an array of this form:
array([array([1,2,3], dtype=float32),
array([4,5,6], dtype=float32),
array([7,8,9], dtype=float32)], dtype=object)
ndim : 1
dtype('O')
shape : (11530,)
size : 11530
I don't understand why it doesn't give me the same array as numpy methods gives me.