0

I have a dataframe containing a column vector with around 500 000 rows of array vector. What I'm trying to do is unloading the content of this column into a 2 dimension array but I don't know the fastest way to do it.

This is the format of the array I'm trying to obtain ([1, 2], [3, 4] and [5, 6] are array contained in my dataframe):

array([[1, 2],
       [3, 4],
       [5, 6]])

I tried to_numpy, as_matrix, and .values but it gives me a 1D array which is not what I'm looking for:

array([array([1, 2]),
       array([3, 4]),
       array([5, 6])])

The only methods which gave me the result I want are np.asarray() and np.array() but they take too much time in my case.

What I want is the same array I obtain with using numpy array methods (vector1,2 and 8) but faster if possible because it takes too much time when we have lot of data.

Thank you for your help !

edit : Here is my function which does the following: it takes in parameter a dataframe which which contains two columns : id and vectors which is a serie of array objects.

 id      vectors
  1      array([1,2,3], dtype='float32')
  2      array([3,4,5], dtype='float32')
  3      array([6,7,8], dtype='float32')

[11530 rows x 2 columns]

What i want to do with this function is unloading the content of column id in a list which is fast and easy and the content of column vectors into an array. So i want a 2 dimensional array of array vectors.

def filter_df(df, request):

start = time.time()
filtered_df = df
ids = filtered_df['id'].tolist()

filtered_df_vectors = filtered_df['vectors'].tolist()

vectors9 = np.array(filtered_df['vectors'].tolist())

vectors1 = np.asarray(filtered_df_vectors)

vectors2 = np.array([f for f in filtered_df_vectors],dtype=np.float32)

vectors3 = filtered_df['vectors'].as_matrix()

vectors4 = filtered_df['vectors'].to_numpy()

vectors5 = filtered_df['vectors'].values

vectors6 = filtered_df.iloc[:,-1].values

vectors8 = np.array(filtered_df['vectors'].values.tolist())

vectors9 = np.array(filtered_df['vectors'].tolist())


filter_duration= time.time()-start
logger.info(f"duration: {filter_duration}s")
return ids,vectors2,filter_duration

I can't copy paste the exact output it returns me for the resulted arrays because it will be unreadable for you so i will just show the two type of array i obtain with the multiple methods i tested.

For vectors 1, 2, 8 and 9 where i use numpy methods, i obtain this format which is the one i'm looking for but it takes two much time (around 0,7 second which is too slow for my case). I wont copy paste the exact array i obtain because it will be unreadable for you. Know just that [1,2,3] represent Here is what i obtain :

array([[1,2,3],
      [4,5,6],
      [7,8,9]], dtype=float32)

ndim : 2
dtype('float32')
shape : (11530, 300)
size : 3459000

For vectors 3, 4, 5 and 6 where i use no numpy methods like pandas to_numpy or as_matrix are fast (~0.05 sec) but returns me with the same entry an array of this form:

array([array([1,2,3], dtype=float32),
       array([4,5,6], dtype=float32),
       array([7,8,9], dtype=float32)], dtype=object)

ndim : 1
dtype('O')
shape : (11530,)
size : 11530

I don't understand why it doesn't give me the same array as numpy methods gives me.

nipato
  • 3
  • 2
  • how about `df.values`? – Quang Hoang May 29 '19 at 13:59
  • If `to_numpy` doesn't work, use `np.array(df['vector'].tolist())` or `np.array(df['vector'].map(list).tolist())`. – cs95 May 29 '19 at 14:03
  • df.values returns me the same format as the second one which is not wanted. I already tried working with np.array but it takes too much time to create the array in my case because there is a lot of data. Do you think there is a faster way to do it ? – nipato May 29 '19 at 14:19
  • Could you add minimal representative sample dataframe and expected output? – Divakar May 29 '19 at 14:37
  • @cs95 should this maybe be reopened, since OP noted that he tried the solution which are provided in the linked answer? – Erfan May 29 '19 at 14:51
  • The dataframe apparently is using object dtype to store many 2 element arrays. `stack(series.values)` is best you'll get. – hpaulj May 29 '19 at 14:54
  • @Erfan I will be happy to if op can provide a [mcve] explaining why the current solutions don't work. My first comment also has other options they haven't tried yet. – cs95 May 29 '19 at 14:55
  • How is the duplicate relevant? He's already trying values and to_numpy. – hpaulj May 29 '19 at 14:56
  • 1
    I reopened this. You should be clearer about what did work, even if is too slow for your taste. And a small sample to be sure we are on the same page. But if my guess is right, converting an large array (or series) of small arrays to one array is going to take time (generating that series probably took time as well). Thousands of 2 element arrays is not an efficient data structure! – hpaulj May 29 '19 at 15:42
  • I edited my answer with more details – nipato May 31 '19 at 10:12
  • The fastest way would be having eacho component of the vectors in one column. Is there a reason you can't have that? I agree with @hpaulj that probably anything else will be way slower. – Stop harming Monica May 31 '19 at 10:42

0 Answers0