How do I cut a column in two bins and aggregate data per bin using groupby?

Question

So, this is my dataframe.

session_id  question_difficulty     attempt_updated_at
5c822af21c1fba22            2   1557470128000
5c822af21c1fba22            3   1557469685000
5c822af21c1fba22            4   1557470079000
5c822af21c1fba22            5   1557472999000
5c822af21c1fba22            3   1557474145000
5c822af21c1fba22            3   1557474441000
5c822af21c1fba22            4   1557474299000
5c822af21c1fba22            4   1557474738000
5c822af21c1fba22            3   1557475430000
5c822af21c1fba22            4   1557476960000
5c822af21c1fba22            5   1557477458000
5c822af21c1fba22            2   1557478118000
5c822af21c1fba22            5   1557482556000
5c822af21c1fba22            4   1557482809000
5c822af21c1fba22            5   1557482886000
5c822af21c1fba22            5   1557484232000

I want to cut the field 'attempt_updated_at'(which is epoch time) into 2 equal bins and find mean of 'question_difficulty' in that bin per session.

I want to store the mean of 1st bin and 2nd bin separately.

I tried to go through pd.cut but I do not know how to use it.

I expect my output to be like,

for example,

session_id         mean1_difficulty       mean2_difficulty
5c822af21c1fba22            5.0                3.0

Any idea is appreciated, Thank you.

score 2 · Accepted Answer · answered Jul 09 '19 at 07:42

I believe you need qcut with aggregate mean:

df1 = (df.groupby(['session_id', pd.qcut(df['attempt_updated_at'], 2, labels=False)])
         ['question_difficulty'].mean()
                                .unstack()
                                .rename(columns=lambda x: f'mean{x+1}_difficulty'))
print (df1)
attempt_updated_at  mean1_difficulty  mean2_difficulty
session_id                                            
5c822af21c1fba22                 3.5             4.125

Or cut:

df1 = (df.groupby(['session_id', pd.cut(df['attempt_updated_at'], 2, labels=False)])
         ['question_difficulty'].mean()
                                .unstack()
                                .rename(columns=lambda x: f'mean{x+1}_difficulty'))
print (df1)
attempt_updated_at  mean1_difficulty  mean2_difficulty
session_id                                            
5c822af21c1fba22            3.444444          4.285714

Difference between functions is better explain here.

Thank you it works, but is there a way in the pd.cut method to filter out rows which will have 0 in either 'mean1_difficulty' or 'mean2_difficulty' ? — RedDragon, Jul 09 '19 at 08:09
@RedDragon - Do you think filter by [boolean indexing](https://stackoverflow.com/q/17071871) ? — jezrael, Jul 09 '19 at 08:10
@RedDragon - use `df = df[pd.cut(df['attempt_updated_at'], 2, labels=False) == 0]` — jezrael, Jul 09 '19 at 08:14

Itamar · Answer 2 · 2019-07-10T11:59:52.327

1

I think that should do it:

pdf.sort_values('attempt_updated_at', ascending=False, inplace=True).reset_index(drop=True)
first = pdf.iloc[:pdf.shape[0] // 2]
second = pdf.iloc[pdf.shape[0] // 2:]

res = pd.DataFrame(first.groupby('session_id')['question_difficulty'].agg('mean')) \
    .rename(columns={'question_difficulty': 'mean1_difficulty'}) \
    .join(second.groupby('session_id')['question_difficulty'].agg('mean')) \
    .rename(columns={'question_difficulty': 'mean2_difficulty'})

edited Jul 10 '19 at 11:59

answered Jul 09 '19 at 08:15

Itamar

129
1
4

**first = pdf.iloc[:pdf.shape[0] // 2] second = pdf.iloc[pdf.shape[0] // 2:]** This will divide the dataframe but i want to divide it based on the epoch time. – RedDragon Jul 09 '19 at 08:45
I misunderstood, only followed the need to cut it "into 2 equal bins".... anyhows, sorting the data-frame by the epoch column would still be a winner. – Itamar Jul 10 '19 at 11:51
1

There, I have edited the code to support the cutting of data by the epoch time value (the "ascending=True" was just to make sure the index is indeed reset correctly, you can drop it. – Itamar Jul 10 '19 at 12:01

How do I cut a column in two bins and aggregate data per bin using groupby?

2 Answers2