0

I have a pandas DataFrame df that looks like:

df = 

sample   col1   data_value  time_stamp
A          1     15           0.5
A          1     45           0.5
A          1     32           0.5
A          2     3             1
A          2     57            1
A          2     89            1
B          1     10           0.5
B          1     20           0.5
B          1     30           0.5 
B          2     12            1
B          2     24            1
B          2     36            1

For a given sample and its respective column, I am trying to condense all data values into a numpy array in a new column merged_data to look like:

sample   col1   merged_data      time_stamp
A          1     [15, 45, 32]        0.5
A          2     [3, 57, 89]          1
B          1     [10, 20, 30]        0.5
B          2     [12, 24, 36]         1

I've tried using df['merged_data] = df.to_numpy() operations and df['merged_data'] = np.array(df.iloc[0:2, :].to_numpy(), but they don't work. All elements in the merged_data column need to be numpy arrays or lists (can easily convert between the two).

Lastly, I need to retain the time_stamp column for each combination of sample and col. How can I include this with the groupby?

Any help or thoughts would be greatly appreciated!

MAtennis9
  • 117
  • 7

2 Answers2

2

you can do this :

df = df.groupby(['sample','col1'], as_index=False)['data_value'].agg(list)

output:

>>
  sample  col1    data_value
0      A     1  [15, 45, 32]
1      A     2   [3, 57, 89]
2      B     1  [10, 20, 30]
3      B     2  [12, 24, 36]
eshirvana
  • 23,227
  • 3
  • 22
  • 38
  • 1
    `df.groupby(['sample','col1'], as_index=False)['data_value'].agg(list)` avoids to `reset_index` ;) – mozway Sep 21 '22 at 03:27
2

If the number of values in each group is identical, you can use:

import numpy as np
a = np.vstack(df.groupby(['sample','col1'])['data_value'].agg(list))

Or:

a = (df
 .assign(col=lambda d: d.groupby(['sample', 'col1']).cumcount())
 .pivot(['sample', 'col1'], 'col', 'data_value')
 .to_numpy()
)

output:

array([[15, 45, 32],
       [ 3, 57, 89],
       [10, 20, 30],
       [12, 24, 36]])
mozway
  • 194,879
  • 13
  • 39
  • 75
  • I updated the question to be slightly different. How do I retain the `time_stamp` column in the final output grouped by `sample` and `col1'? – MAtennis9 Sep 21 '22 at 04:08
  • `df.groupby(['sample','col1'], as_index=False).agg({'data_value': list, 'time_stamp': 'first'})`, but also already covered in other questions ;) – mozway Sep 21 '22 at 04:13