Group by one column to assign quantiles based on another column

Question

Here's a replicable example of my data. Suppose I have the following pandas.DataFrame stored in df:

# Import libraries
import numpy as np
import pandas as pd
from scipy import stats

# 1K observations from two random variables
X1 = stats.norm(loc=20000, scale=5000).rvs(1000, random_state=123)
X2 = stats.norm(loc=1500, scale=300).rvs(1000, random_state=123)
df = pd.DataFrame({'id':range(1000), 'v1':X1, 'v2':X2})

# Assign deciles according to `v1`
df['v1_bin'] = pd.qcut(x=df['v1'], q=10, labels=False)
print(df.head())

   id            v1           v2  v1_bin
0   0  14571.846983  1174.310819       1
1   1  24986.727233  1799.203634       8
2   2  21414.892490  1584.893549       6
3   3  12468.526430  1048.111586       0
4   4  17106.998740  1326.419924       2

I now want to group by v1_bin and split each group into its own deciles based on v2.

I tried the following:

### Attempt
# Group by 'v1_bin' and assign deciles according to `v2`
df['v2_bin'] = df.groupby('v1_bin')['v2'].quantile(np.arange(0.1, 1.1, 0.1))

v1_bin     
0       0.1     772.493815
        0.2     867.479244
        0.3     922.716877
        0.4     960.185441
        0.5     984.500809
                  ...

However, this last line returns 100 values which represent the upper limits of each decile. I want the labels of each observation. That is, 1000 values, each with their own label.

In other words, I want my final table to look like this (note that the labels in v2_bin are made up):

   id            v1           v2  v1_bin  v2_bin
0   0  14571.846983  1174.310819       1       1
1   1  24986.727233  1799.203634       8       1
2   2  21414.892490  1584.893549       6       0
3   3  12468.526430  1048.111586       0       1
4   4  17106.998740  1326.419924       2       4

How can this be done?

I voted to close my own question. There's another thread [here](https://stackoverflow.com/questions/19403133/pandas-groupby-and-qcut) and I did not realize. — Arturo Sbr, Jun 11 '21 at 19:23

score 2 · Accepted Answer · answered Jun 11 '21 at 18:59

2

Try qcut again with groupby transform:

df['v2_bin'] = (
    df.groupby('v1_bin')['v2']
        .transform(lambda g: pd.qcut(g, q=10, labels=False))
)

df.head():

   id            v1           v2  v1_bin  v2_bin
0   0  14571.846983  1174.310819       1       5
1   1  24986.727233  1799.203634       8       4
2   2  21414.892490  1584.893549       6       3
3   3  12468.526430  1048.111586       0       7
4   4  17106.998740  1326.419924       2       8

answered Jun 11 '21 at 18:59

Henry Ecker

34,399
18
41
57

Thanks! Could you briefly explain what `transform` does and why it's necessary for the lambda? – Arturo Sbr Jun 11 '21 at 19:07
2

So actually `transform` and `apply` both work here. So essentially we're just performing `qcut` on each group. I went with `transform` for the idiomatic reason that is that we're _transforming_ each group of 'v2' into quantiles. There are also benefits to the way that the data is passed to `transform` vs `apply`. But there's a ton of good information here: [Apply vs transform on a group object](https://stackoverflow.com/q/27517425/15497888) that details the differences between the two possible options. – Henry Ecker Jun 11 '21 at 19:11

score 1 · Answer 2 · answered Jun 11 '21 at 19:25

Here a more handcrafted version:

new_bins = [] 
for i in df["v1_bin"].unique(): 
    new_bins.append(pd.qcut(x=df[df["v1_bin"] == i]["v2"], q=10, labels=False)) 
 
df["v2_bin"]=pd.concat(new_bins, axis=0).sort_index() 
df  

          id            v1           v2  v1_bin  v2_bin
0      0  14571.846983  1174.310819       1       5
1      1  24986.727233  1799.203634       8       4
2      2  21414.892490  1584.893549       6       3
3      3  12468.526430  1048.111586       0       7
4      4  17106.998740  1326.419924       2       8
..   ...           ...          ...     ...     ...
995  995  23173.815658  1690.428939       7       4
996  996  25349.592995  1820.975580       8       6
997  997  15453.364917  1227.201895       1       9
998  998  22351.318738  1641.079124       7       0
999  999  14442.847761  1166.570866       1       4

Group by one column to assign quantiles based on another column

2 Answers2