Here's a replicable example of my data. Suppose I have the following pandas.DataFrame
stored in df
:
# Import libraries
import numpy as np
import pandas as pd
from scipy import stats
# 1K observations from two random variables
X1 = stats.norm(loc=20000, scale=5000).rvs(1000, random_state=123)
X2 = stats.norm(loc=1500, scale=300).rvs(1000, random_state=123)
df = pd.DataFrame({'id':range(1000), 'v1':X1, 'v2':X2})
# Assign deciles according to `v1`
df['v1_bin'] = pd.qcut(x=df['v1'], q=10, labels=False)
print(df.head())
id v1 v2 v1_bin
0 0 14571.846983 1174.310819 1
1 1 24986.727233 1799.203634 8
2 2 21414.892490 1584.893549 6
3 3 12468.526430 1048.111586 0
4 4 17106.998740 1326.419924 2
I now want to group by v1_bin
and split each group into its own deciles based on v2
.
I tried the following:
### Attempt
# Group by 'v1_bin' and assign deciles according to `v2`
df['v2_bin'] = df.groupby('v1_bin')['v2'].quantile(np.arange(0.1, 1.1, 0.1))
v1_bin
0 0.1 772.493815
0.2 867.479244
0.3 922.716877
0.4 960.185441
0.5 984.500809
...
However, this last line returns 100 values which represent the upper limits of each decile. I want the labels of each observation. That is, 1000 values, each with their own label.
In other words, I want my final table to look like this (note that the labels in v2_bin
are made up):
id v1 v2 v1_bin v2_bin
0 0 14571.846983 1174.310819 1 1
1 1 24986.727233 1799.203634 8 1
2 2 21414.892490 1584.893549 6 0
3 3 12468.526430 1048.111586 0 1
4 4 17106.998740 1326.419924 2 4
How can this be done?