2

I have grouped a dataframe and I would like, per group, to replace the values in some columns if they or smaller than some percentile. So all values within a group that are larger than the 0.95 percentile should be replaced by the 0.95 percentile and all the values that are smaller than the 0.05 percentile should be replaced by the 0.05 percentile.
The dataframe could look like this(example taken from another question): Two groups: ‘one’ and ‘two’

    A           B           C
0   0.719391    0.091693    one
1   0.951499    0.83716     one
2   0.975212    0.224855    one
3   0.80762     0.031284    one
4   0.63319     0.342889    one
5   0.075102    0.899291    one
6   0.502843    0.773424    one
7   0.032285    0.242476    one
8   0.794938    0.607745    one
9   0.620387    0.574222    one
10  0.446639    0.549749    two
11  0.664324    0.134041    two
12  0.622217    0.505057    two
13  0.670338    0.99087     two
14  0.281431    0.016245    two
15  0.675756    0.185967    two
16  0.145147    0.045686    two
17  0.404413    0.191482    two
18  0.94913     0.943509    two
19  0.164642    0.157013    two

The result for this dataframe should be:

A               B           C
0   0.719391    0.091693    one
1   0.951499    0.83716     one
2   0.96454115  0.224855    one
3   0.80762     0.05846805  one
4   0.63319     0.342889    one
5   0.075102    0.87133205  one
6   0.502843    0.773424    one
7   0.05155265  0.242476    one
8   0.794938    0.607745    one
9   0.620387    0.574222    one
10  0.446639    0.549749    two
11  0.664324    0.134041    two
12  0.622217    0.505057    two
13  0.670338    0.96955755  two
14  0.281431    0.02949345  two
15  0.675756    0.185967    two
16  0.15391975  0.045686    two
17  0.404413    0.191482    two
18  0.8261117   0.943509    two
19  0.164642    0.157013    two

Notice that for column A rows 2, 7, 16 and 18 have been replaced; and for column B rows 3, 5, 13, and 14 have been replaced.

Does anyone knows how to do this in an efficient way for a large dataframe?

Thanks

cs95
  • 379,657
  • 97
  • 704
  • 746
Jamie Lee
  • 83
  • 1
  • 10

3 Answers3

4

You could use groupby + quantile + clip:

g = df.groupby('C').transform(lambda x: x.clip(*x.quantile([0.05, 0.95])))
g['C'] = df['C']

           A         B    C
0   0.719391  0.091693  one
1   0.951499  0.837160  one
2   0.964541  0.224855  one
3   0.807620  0.058468  one
4   0.633190  0.342889  one
5   0.075102  0.871332  one
6   0.502843  0.773424  one
7   0.051553  0.242476  one
8   0.794938  0.607745  one
9   0.620387  0.574222  one
10  0.446639  0.549749  two
11  0.664324  0.134041  two
12  0.622217  0.505057  two
13  0.670338  0.969558  two
14  0.281431  0.029493  two
15  0.675756  0.185967  two
16  0.153920  0.045686  two
17  0.404413  0.191482  two
18  0.826112  0.943509  two
19  0.164642  0.157013  two

Sanity Check

np.allclose(e[['A', 'B']].values, g[['A', 'B']].values)
True

Here, e is the output in your question.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • LOL , still using pandas' clip is better ~ ;-) – BENY Nov 08 '17 at 18:56
  • I must say it's taking much more time than I expected: my df has more than 4 millions rows and it has been more than 20 mins since I started my run. Would it be faster if I selected only one column to transform? If yes, what is the most efficien way to do this? Thanks – Jamie Lee Nov 08 '17 at 20:52
  • @JamieLee Can't say... perhaps it is the sheer size... groupby operations are generally slow. There's an alternative called `dask` which is used for fast processing of large pandas dataframe, if you're interested, you can look into it. – cs95 Nov 08 '17 at 20:54
  • my run gives me the error "length mismatch: expected axis has ..". is er a work around for this? – Jamie Lee Nov 08 '17 at 21:22
  • @JamieLee Can you open a new Q with a replica of the data that produces this error? – cs95 Nov 08 '17 at 22:44
  • @cs95 what is the purpose of the * inside x.clip()? – gplt Aug 21 '22 at 00:37
2
df.groupby('C')['A','B'].transform(lambda x : np.clip(x,x.quantile(0.05),x.quantile(0.95)))
Out[1599]: 
           A         B
0   0.719391  0.091693
1   0.951499  0.837160
2   0.964541  0.224855
3   0.807620  0.058468
4   0.633190  0.342889
5   0.075102  0.871332
6   0.502843  0.773424
7   0.051553  0.242476
8   0.794938  0.607745
9   0.620387  0.574222
10  0.446639  0.549749
11  0.664324  0.134041
12  0.622217  0.505057
13  0.670338  0.969558
14  0.281431  0.029493
15  0.675756  0.185967
16  0.153920  0.045686
17  0.404413  0.191482
18  0.826112  0.943509
19  0.164642  0.157013
BENY
  • 317,841
  • 20
  • 164
  • 234
1

To make it faster you could do it by this method, though this would involve a lot more coding lines if you have lots of columns. It is very fast method (<1s) with my dataset that includes 2 million rows. Like yours the .groupby is very slow!

A_05 = df['A'].quantile(0.05)
A_95 = df['A'].quantile(0.95)
df['A'].clip(A_05, A_95, inplace=True) 
AshU
  • 21
  • 3