2

So I have a dataframe like this

df = pd.DataFrame({
'A': [1,1,2,2,3,3,3],
'B': [1,3,1,3,1,2,1],
'C': [1,3,5,3,7,7,1]})

    A   B   C
0   1   1   1
1   1   3   3
2   2   1   5
3   2   3   3
4   3   1   7
5   3   2   7
6   3   1   1

I want to create a binning of column B (count) with groupby of column A for example B_bin1 where B < 3 and B_bin2 is the rest (>=3), C_bin1 for C < 5 and C_bin2 for the rest

From that example the output I want is like this

    A   B_bin1  B_bin2  C_bin1  C_bin2
0   1   1       1       2       0
1   2   1       1       1       1
2   3   3       0       1       2

I found similar question Pandas groupby with bin counts , it is working for 1 bin

bins = [0,2,10]
temp_df=df.groupby(['A', pd.cut(df['B'], bins)])
temp_df.size().unstack()
B   (0, 2]  (2, 10]
A       
1   1       1
2   1       1
3   3       0

but when I tried using more than 1 bin, it is not working (my real data has a lot of binning groups)

bins = [0,2,10]
bins2 = [0,4,10]
temp_df=df.groupby(['A', pd.cut(df['B'], bins), pd.cut(df['C'], bins2)])
temp_df.size().unstack()
        C   (0, 4]  (4, 10]
A       B       
1   (0, 2]  1       0
    (2, 10] 1       0
2   (0, 2]  0       1
    (2, 10] 1       0
3   (0, 2]  1       2
    (2, 10] 0       0

My workaround is by create small temporary df and then binning them using 1 group 1 by 1 and then merge them in the end

I also still trying using aggregation (probably using pd.NamedAgg too) similar to this, but I wonder if that can works

df.groupby('A').agg(
    b_count = ('B', 'count'),
    b_sum = ('B', 'sum')
    c_count = ('C', 'count'),
    c_sum = ('C', 'sum')
)

Is anyone has another idea for this?

d_frEak
  • 440
  • 3
  • 12

2 Answers2

2

Because you need processing each bin separately instead groupby+size+unstack is used crosstab with join DataFrames by concat:

bins = [0,2,10]
bins2 = [0,4,10]

temp_df1=pd.crosstab(df['A'], pd.cut(df['B'], bins, labels=False)).add_prefix('B_')
temp_df2=pd.crosstab(df['A'], pd.cut(df['C'], bins2, labels=False)).add_prefix('C_')

df = pd.concat([temp_df1, temp_df2], axis=1).reset_index()
print (df)
   A  B_0  B_1  C_0  C_1
0  1    1    1    2    0
1  2    1    1    1    1
2  3    3    0    1    2
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

One option, is with get_dummies, before the aggregation; this works since you have a limited bin (I'm skipping the bin and using comparison):

temp = (df
        .assign(B = df.B.lt(3), C = df.C.lt(5))
        .replace({True:1, False:2})
        )
(pd
.get_dummies(temp, columns = ['B','C'], prefix_sep='_bin')
.groupby('A')
.sum()
)
   B_bin1  B_bin2  C_bin1  C_bin2
A
1       1       1       2       0
2       1       1       1       1
3       3       0       1       2

You could use the bins, along with pd.factorize and get_dummies:

temp = df.copy()
temp['B'] = pd.cut(df.B, bins)
temp['B'] = pd.factorize(temp.B)[0] + 1
temp['C'] = pd.cut(df.C, bins2)
temp['C'] = pd.factorize(temp.C)[0] + 1

(pd
.get_dummies(temp, columns = ['B','C'], prefix_sep='_bin')
.groupby('A')
.sum()
)

   B_bin1  B_bin2  C_bin1  C_bin2
A
1       1       1       2       0
2       1       1       1       1
3       3       0       1       2
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • 1
    thank you for another idea. Yeah for the 1st one it works when I have 2 bins for each group, I plan to do the loop for the whole bin groups that I have and not all of them only have 2 bins. But I like your 2nd solution by using cut, factorize, then get_dummies trick, and it can be used for any number of bins. Thank you I learn new pd function that I never used before – d_frEak Jul 11 '22 at 01:08