For this data, I'd suggest you pivot the data, and pass the mean. Usually, this is faster since you are hitting the entire dataframe, instead of going through each group:
(df
.pivot(None, ['a_bins', 'b_bins', 'c_bins'], 'value')
.mean()
.sort_index() # ignore this if you are not fuzzy on order
)
a_bins b_bins c_bins
(-2.15, -2.1] (0.25, 0.3] (-1.3, -1.25] 0.929100
(0.75, 0.8] (-0.3, -0.25] 0.480411
(-2.05, -2.0] (-0.1, -0.05] (0.3, 0.35] -1.684900
(0.75, 0.8] (-0.25, -0.2] -1.184411
(-2.0, -1.95] (-0.6, -0.55] (-1.2, -1.15] -0.021176
...
(1.7, 1.75] (-0.75, -0.7] (1.05, 1.1] -0.229518
(1.85, 1.9] (-0.4, -0.35] (1.8, 1.85] 0.003017
(1.9, 1.95] (-1.45, -1.4] (0.1, 0.15] 0.949361
(2.05, 2.1] (-0.35, -0.3] (-0.65, -0.6] 0.763184
(2.25, 2.3] (-0.95, -0.9] (0.1, 0.15] 2.539432
This matches the output from the groupby:
(df
.groupby(['a_bins','b_bins','c_bins'])
.agg({'value':['mean']})
.dropna()
.squeeze()
)
a_bins b_bins c_bins
(-2.15, -2.1] (0.25, 0.3] (-1.3, -1.25] 0.929100
(0.75, 0.8] (-0.3, -0.25] 0.480411
(-2.05, -2.0] (-0.1, -0.05] (0.3, 0.35] -1.684900
(0.75, 0.8] (-0.25, -0.2] -1.184411
(-2.0, -1.95] (-0.6, -0.55] (-1.2, -1.15] -0.021176
...
(1.7, 1.75] (-0.75, -0.7] (1.05, 1.1] -0.229518
(1.85, 1.9] (-0.4, -0.35] (1.8, 1.85] 0.003017
(1.9, 1.95] (-1.45, -1.4] (0.1, 0.15] 0.949361
(2.05, 2.1] (-0.35, -0.3] (-0.65, -0.6] 0.763184
(2.25, 2.3] (-0.95, -0.9] (0.1, 0.15] 2.539432
Name: (value, mean), Length: 100, dtype: float64
The pivot option gives a speed of 3.72ms on my PC, while I had to terminate the groupby option, as it was taking too long (my PC is quite old :))
Again, the reason why this works/is faster is because the mean is hitting the entire dataframe, and not going through groups in the groupby.
As to your other question, you can index it easily:
bin_mean = (df
.pivot(None, ['a_bins', 'b_bins', 'c_bins'], 'value')
.mean()
.sort_index() # ignore this if you are not fuzzy on order
)
bin_mean.loc[(-1.72, 0.32, 1.18)]
-0.25243603652138985
The main problem though is Pandas for categoricals will return for all rows( which is wasteful, and not efficient); pass observed = True
and you should notice a dramatic improvement:
(df.groupby(['a_bins','b_bins','c_bins'], observed=True)
.agg({'value':['mean']})
)
value
mean
a_bins b_bins c_bins
(-2.15, -2.1] (0.25, 0.3] (-1.3, -1.25] 0.929100
(0.75, 0.8] (-0.3, -0.25] 0.480411
(-2.05, -2.0] (-0.1, -0.05] (0.3, 0.35] -1.684900
(0.75, 0.8] (-0.25, -0.2] -1.184411
(-2.0, -1.95] (-0.6, -0.55] (-1.2, -1.15] -0.021176
... ...
(1.7, 1.75] (-0.75, -0.7] (1.05, 1.1] -0.229518
(1.85, 1.9] (-0.4, -0.35] (1.8, 1.85] 0.003017
(1.9, 1.95] (-1.45, -1.4] (0.1, 0.15] 0.949361
(2.05, 2.1] (-0.35, -0.3] (-0.65, -0.6] 0.763184
(2.25, 2.3] (-0.95, -0.9] (0.1, 0.15] 2.539432
Speed is about 7.39ms on my PC, about 2 times less than the pivot option, but way faster now, and that's because only categoricals that exist in the dataframe are used/returned.