0

I make bins out of my column using pandas' pd.qcut(). I would like to, then apply smoothing by corresponding bin's mean value.

I generate my bins with something like

pd.qcut(col, 3)

For example, Given the column values [4, 8, 15, 21, 21, 24, 25, 28, 34] and the generated bins

Bin1 [4, 15]: 4, 8, 15
Bin2 [21, 24]: 21, 21, 24
Bin3 [25, 34]: 25, 28, 34

I would like to replace the values with the following means

Mean of Bin1 (4, 8, 15) = 9
Mean of Bin2 (21, 21, 24) = 22
Mean of Bin3 (25, 28, 34) = 29

Therefore:

Bin1: 9, 9, 9
Bin2: 22, 22, 22
Bin3: 29, 29, 29

making the final dataset: [9, 9, 9, 22, 22, 22, 29, 29, 29]

How can one also add a column with closest bin boundaries?

Bin1: 4, 4, 15
Bin2: 21, 21, 24
Bin3: 25, 25, 34

making the final dataset: [4, 4, 15, 21, 21, 24, 25, 25, 34]

very similar to this question which is for R

Blaine
  • 576
  • 9
  • 30

2 Answers2

2

It's exactly as you laid out. Using this technique to get nearest

df = pd.DataFrame({"col":[4, 8, 15, 21, 21, 24, 25, 28, 34]})

df2 = df.assign(bin=pd.qcut(df.col, 3),
         colbmean=lambda dfa: dfa.groupby("bin").transform("mean"),
         colbin=lambda dfa: dfa.apply(lambda r: min([r.bin.left,r.bin.right], key=lambda x: abs(x-r.col)), axis=1))

col bin colbmean colbin
0 4 (3.999, 19.0] 9 3.999
1 8 (3.999, 19.0] 9 3.999
2 15 (3.999, 19.0] 9 19
3 21 (19.0, 24.333] 22 19
4 21 (19.0, 24.333] 22 19
5 24 (19.0, 24.333] 22 24.333
6 25 (24.333, 34.0] 29 24.333
7 28 (24.333, 34.0] 29 24.333
8 34 (24.333, 34.0] 29 34
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
  • Any way to achieve this without adding additional columns? I intend to replace the value in column. If not, I am fine with adding and dropping the column instead. – Blaine Feb 09 '21 at 16:00
  • 1
    with both you can change the `assign()` to go straight to the column you want (overwrite). **bin** in mean could be done the the `groupby()` in the nearest, I don't see a practical way of doing it without creating the **bin** column then. deleting it – Rob Raymond Feb 09 '21 at 16:25
  • One final question, What if my column name has a space in it? How would the syntax for assign go? – Blaine Feb 09 '21 at 19:53
  • 1
    just switch to longer hand version `df["my col"]` and it will work – Rob Raymond Feb 09 '21 at 20:26
1

You'll find below the solution I came up with to answer your problem. There is still a limitation, pandas.qcut does not return closed intervals, for this matter the results are not exactly the one you described.

import pandas as pd

df = pd.DataFrame({'value': [4, 8, 15, 21, 21, 24, 25, 28, 34]})

df['bin'] = pd.qcut(df['value'], 3)
df = df.join(df.groupby('bin')['value'].mean(), on='bin', rsuffix='_average_in_bin')

df['bin_left'] = df['bin'].apply(lambda x: x.left)
df['bin_right'] = df['bin'].apply(lambda x: x.right)
df['nearest_boundary'] = df.apply(lambda x: x['bin_left'] if abs(x['value'] - x['bin_left']) < abs(x['value'] - x['bin_right']) else x['bin_right'], axis=1)
arhr
  • 1,505
  • 8
  • 16