2

I would to change the value of certain DataFrame values only if a certain condition is met an n number of consecutive times.

Example:

df = pd.DataFrame(np.random.randn(15, 3))
df.iloc[4:8,0]=40
df.iloc[12,0]=-40
df.iloc[10:12,1]=-40

Which gives me this DF:

            0          1         2
0    1.238892   0.802318 -0.013856
1   -1.136326  -0.527263 -0.260975
2    1.118771   0.031517  0.527350
3    1.629482  -0.158941 -1.045209
4   40.000000   0.598657 -1.268399
5   40.000000   0.442297 -0.016363
6   40.000000  -0.316817  1.744822
7   40.000000   0.193083  0.914172
8    0.322756  -0.680682  0.888702
9   -1.204531  -0.240042  1.416020
10  -1.337494 -40.000000 -1.195780
11  -0.703669 -40.000000  0.657519
12 -40.000000  -0.288235 -0.840145
13  -1.084869  -0.298030 -1.592004
14  -0.617568  -1.046210 -0.531523

Now, if I do

a=df.copy()
a[ abs(a) > abs(a.std()) ] = float('nan')

I get

           0         1         2
0   1.238892  0.802318 -0.013856
1  -1.136326 -0.527263 -0.260975
2   1.118771  0.031517  0.527350
3   1.629482 -0.158941       NaN
4        NaN  0.598657       NaN
5        NaN  0.442297 -0.016363
6        NaN -0.316817       NaN
7        NaN  0.193083  0.914172
8   0.322756 -0.680682  0.888702
9  -1.204531 -0.240042       NaN
10 -1.337494       NaN       NaN
11 -0.703669       NaN  0.657519
12       NaN -0.288235 -0.840145
13 -1.084869 -0.298030       NaN
14 -0.617568 -1.046210 -0.531523

which is fair. However, I would like only to replace the values with NaN if these conditions were met by a maximum of 2 consecutive entries (so I can interpolate later). For example, I wanted the result to be

            0          1         2
0    1.238892   0.802318 -0.013856
1   -1.136326  -0.527263 -0.260975
2    1.118771   0.031517  0.527350
3    1.629482  -0.158941       NaN
4   40.000000   0.598657       NaN
5   40.000000   0.442297 -0.016363
6   40.000000  -0.316817       NaN
7   40.000000   0.193083  0.914172
8    0.322756  -0.680682  0.888702
9   -1.204531  -0.240042       NaN
10  -1.337494        NaN       NaN
11  -0.703669        NaN  0.657519
12        NaN  -0.288235 -0.840145
13  -1.084869  -0.298030       NaN
14  -0.617568  -1.046210 -0.531523

Apparently there's no ready-to-use method to do this. The solution I found that closest resembles my problem was this one, but I couldn't make it work for me.

Any ideas?

Community
  • 1
  • 1
TomCho
  • 3,204
  • 6
  • 32
  • 83

1 Answers1

3

See below - the tricky part is (cond[c] != cond[c].shift(1)).cumsum() which breaks the data into contiguous runs of the same value.

In [23]: cond = abs(df) > abs(df.std())

In [24]: for c in df.columns:
    ...:     grouper = (cond[c] != cond[c].shift(1)).cumsum() * cond[c]
    ...:     fill = (df.groupby(grouper)[c].transform('size') <= 2)
    ...:     df.loc[fill, c] = np.nan

In [25]: df
Out[25]: 
            0         1         2
0    1.238892  0.802318 -0.013856
1   -1.136326 -0.527263 -0.260975
2    1.118771  0.031517  0.527350
3    1.629482 -0.158941       NaN
4   40.000000  0.598657       NaN
5   40.000000  0.442297 -0.016363
6   40.000000 -0.316817       NaN
7   40.000000  0.193083  0.914172
8    0.322756 -0.680682  0.888702
9   -1.204531 -0.240042       NaN
10  -1.337494       NaN       NaN
11  -0.703669       NaN  0.657519
12        NaN -0.288235 -0.840145
13  -1.084869 -0.298030       NaN
14  -0.617568 -1.046210 -0.531523

To explain a bit more, cond[c] is a boolean series indicating whether your condition is true or not.

The cond[c] != cond[c].shift(1) compares the current row's condition to the next row's. This has the effecting of 'marking' where a run of values begins with the value True.

The .cumsum() converts the bools to integers and takes the cumulative sum. It may not be immediately intuitive, but this 'numbers' the groups of contiguous values. Finally the * cond[c] reassigns all groups that didn't meet the criteria to 0 (using False == 0)

So now you have groups of contiguous numbers that meet your condition, the next step performs a groupby to count how many values are in each group (transform('size').

Finally a new bool condition is used to assign missing values to those groups with 2 or less values meeting the condition.

chrisb
  • 49,833
  • 8
  • 70
  • 70
  • I haven't really understood how this works. Would it be too hard to elaborate? It works really well, though. I'm right now trying to optimize, but it's kind of hard when you don't really grasp what it's doing. – TomCho Sep 30 '15 at 13:59
  • I added a bit more language - the most helpful way to get it is probably to go through each step in a REPL and see the intermediate values. – chrisb Sep 30 '15 at 14:48
  • Great, I think I got it. The first line of the loop was the tricky one! +1 – TomCho Sep 30 '15 at 16:12
  • Hey, just figured you actually need `df.loc[fill & cond[c], c] = np.nan` in the last line or it doesn't work for small dataframes. – TomCho Apr 24 '16 at 18:39