Group by the column in df python

Question

I have a simple df. It has two columns. I want to groupby the values based on column a. Here is a simple example: Any input would be greatly appreciated!

import pandas as pd
import numpy as np
df = pd.DataFrame()
df['a'] = [1, 2, 3, 4, 1, 2]
df['b'] = [10, 20, 30,40, 50,60]

Desired output is:

 df = pd.DataFrame()
 df['a'] = [1, 2, 3, 4]
 df['b'] = [10, 20, 30,40 ]
 df['b1'] = [50, 60, np.nan, np.nan ]

df

To some extent it depends on the structure of the data. Column 'a' has consecutive values, which is then partially repeated. Is this pattern likely to continue? Can you guarantee it? — njp, Aug 09 '22 at 23:04
No. I only provide a minimal example here. The column a has random and repetetive values. — White cow, Aug 09 '22 at 23:06

score 1 · Accepted Answer · answered Aug 10 '22 at 00:11

Here's a way to do what you want. First you want to group by column 'a'. Normally groupby is used to calculate group aggregation functions:

df.groupby('a')['b'].mean()

but in this case we want to keep the values of b associated with each a. You can use

[(a,list(b)) for a,b in df.groupby('a')['b']]    

[(1, [10, 50]), (2, [20, 60]), (3, [30]), (4, [40])]

Conversion of this to a dataframe almost gets us there:

df2 = pd.DataFrame([(a,list(b)) for a,b in df.groupby('a')['b']],
                   columns=['a','temp'])

   a      temp
0  1  [10, 50]
1  2  [20, 60]
2  3      [30]
3  4      [40]

The column temp can be separated into different columns with to_list:

pd.DataFrame(df2['temp'].to_list())

    0     1
0  10  50.0
1  20  60.0
2  30   NaN
3  40   NaN

Rejoin the output dataframes:

df2.join(df3)

   a      temp   0     1
0  1  [10, 50]  10  50.0
1  2  [20, 60]  20  60.0
2  3      [30]  30   NaN
3  4      [40]  40   NaN

And clean up (remove temp column, rename columns and you probably want to do something about the integers cast to floats in the last column due to the NaNs)

I'm sure there's a cleaner way to do this but hopefully this gets you started!

I assume here that the 50 goes into the second column (your `b1` column) as it is associated with the second appearance of a value (1) in column `a`. (Related to my original comment above) — njp, Aug 10 '22 at 00:14

Group by the column in df python

1 Answers1