0

I have a simple df. It has two columns. I want to groupby the values based on column a. Here is a simple example: Any input would be greatly appreciated!

import pandas as pd
import numpy as np
df = pd.DataFrame()
df['a'] = [1, 2, 3, 4, 1, 2]
df['b'] = [10, 20, 30,40, 50,60]

Desired output is:

 df = pd.DataFrame()
 df['a'] = [1, 2, 3, 4]
 df['b'] = [10, 20, 30,40 ]
 df['b1'] = [50, 60, np.nan, np.nan ]

df

White cow
  • 85
  • 6
  • To some extent it depends on the structure of the data. Column 'a' has consecutive values, which is then partially repeated. Is this pattern likely to continue? Can you guarantee it? – njp Aug 09 '22 at 23:04
  • No. I only provide a minimal example here. The column a has random and repetetive values. – White cow Aug 09 '22 at 23:06

1 Answers1

1

Here's a way to do what you want. First you want to group by column 'a'. Normally groupby is used to calculate group aggregation functions:

df.groupby('a')['b'].mean()

but in this case we want to keep the values of b associated with each a. You can use

[(a,list(b)) for a,b in df.groupby('a')['b']]    

[(1, [10, 50]), (2, [20, 60]), (3, [30]), (4, [40])]

Conversion of this to a dataframe almost gets us there:

df2 = pd.DataFrame([(a,list(b)) for a,b in df.groupby('a')['b']],
                   columns=['a','temp'])

   a      temp
0  1  [10, 50]
1  2  [20, 60]
2  3      [30]
3  4      [40]

The column temp can be separated into different columns with to_list:

pd.DataFrame(df2['temp'].to_list())

    0     1
0  10  50.0
1  20  60.0
2  30   NaN
3  40   NaN

Rejoin the output dataframes:

df2.join(df3)

   a      temp   0     1
0  1  [10, 50]  10  50.0
1  2  [20, 60]  20  60.0
2  3      [30]  30   NaN
3  4      [40]  40   NaN

And clean up (remove temp column, rename columns and you probably want to do something about the integers cast to floats in the last column due to the NaNs)

I'm sure there's a cleaner way to do this but hopefully this gets you started!

njp
  • 620
  • 1
  • 3
  • 16
  • I assume here that the 50 goes into the second column (your `b1` column) as it is associated with the second appearance of a value (1) in column `a`. (Related to my original comment above) – njp Aug 10 '22 at 00:14