Replacing data in column with mean value of corresponding bin?

Question

I make bins out of my column using pandas' pd.qcut(). I would like to, then apply smoothing by corresponding bin's mean value.

I generate my bins with something like

pd.qcut(col, 3)

For example, Given the column values [4, 8, 15, 21, 21, 24, 25, 28, 34] and the generated bins

Bin1 [4, 15]: 4, 8, 15
Bin2 [21, 24]: 21, 21, 24
Bin3 [25, 34]: 25, 28, 34

I would like to replace the values with the following means

Mean of Bin1 (4, 8, 15) = 9
Mean of Bin2 (21, 21, 24) = 22
Mean of Bin3 (25, 28, 34) = 29

Therefore:

Bin1: 9, 9, 9
Bin2: 22, 22, 22
Bin3: 29, 29, 29

making the final dataset: [9, 9, 9, 22, 22, 22, 29, 29, 29]

How can one also add a column with closest bin boundaries?

Bin1: 4, 4, 15
Bin2: 21, 21, 24
Bin3: 25, 25, 34

making the final dataset: [4, 4, 15, 21, 21, 24, 25, 25, 34]

very similar to this question which is for R

score 2 · Accepted Answer · answered Feb 09 '21 at 11:25

2

It's exactly as you laid out. Using this technique to get nearest

df = pd.DataFrame({"col":[4, 8, 15, 21, 21, 24, 25, 28, 34]})

df2 = df.assign(bin=pd.qcut(df.col, 3),
         colbmean=lambda dfa: dfa.groupby("bin").transform("mean"),
         colbin=lambda dfa: dfa.apply(lambda r: min([r.bin.left,r.bin.right], key=lambda x: abs(x-r.col)), axis=1))

	col	bin	colbmean	colbin
0	4	(3.999, 19.0]	9	3.999
1	8	(3.999, 19.0]	9	3.999
2	15	(3.999, 19.0]	9	19
3	21	(19.0, 24.333]	22	19
4	21	(19.0, 24.333]	22	19
5	24	(19.0, 24.333]	22	24.333
6	25	(24.333, 34.0]	29	24.333
7	28	(24.333, 34.0]	29	24.333
8	34	(24.333, 34.0]	29	34

answered Feb 09 '21 at 11:25

Rob Raymond

29,118
3
14
30

Any way to achieve this without adding additional columns? I intend to replace the value in column. If not, I am fine with adding and dropping the column instead. – Blaine Feb 09 '21 at 16:00
1

with both you can change the `assign()` to go straight to the column you want (overwrite). **bin** in mean could be done the the `groupby()` in the nearest, I don't see a practical way of doing it without creating the **bin** column then. deleting it – Rob Raymond Feb 09 '21 at 16:25
One final question, What if my column name has a space in it? How would the syntax for assign go? – Blaine Feb 09 '21 at 19:53
1

just switch to longer hand version `df["my col"]` and it will work – Rob Raymond Feb 09 '21 at 20:26

score 1 · Answer 2 · answered Feb 09 '21 at 11:18

You'll find below the solution I came up with to answer your problem. There is still a limitation, pandas.qcut does not return closed intervals, for this matter the results are not exactly the one you described.

import pandas as pd

df = pd.DataFrame({'value': [4, 8, 15, 21, 21, 24, 25, 28, 34]})

df['bin'] = pd.qcut(df['value'], 3)
df = df.join(df.groupby('bin')['value'].mean(), on='bin', rsuffix='_average_in_bin')

df['bin_left'] = df['bin'].apply(lambda x: x.left)
df['bin_right'] = df['bin'].apply(lambda x: x.right)
df['nearest_boundary'] = df.apply(lambda x: x['bin_left'] if abs(x['value'] - x['bin_left']) < abs(x['value'] - x['bin_right']) else x['bin_right'], axis=1)

Replacing data in column with mean value of corresponding bin?

2 Answers2