Delete values occurring between large percentage shifts

Question

I have a timeseries df comprised of daily Interest Rates points in column A ( IR aka Interest Rate) and the relative change from one day to the next in column B ( Rel Shift).

DF looks something like the below:

                   IR      Shift
May/30/2019        5.9%    1.67% 
May/29/2019        6%      1.69%      
**May/28/2019      5.9%   -292%**
May/27/2019        20.2%  -1.4%
May/26/2019        20.5%   2.5% 
**May/25/2019      20%     292%** 
May/24/2019        5.1%    -

My df shape is 4000x 2 and these large percentage changes are persistent throughout the df. In other words, the rates spike up, level off - and then at some point spike down again..or vice versa. So in the above IR changed 292%, leveled off and then spiked down 292%

Goal is to convert values lying between spikes above >50% to np.nan, inclusive of the date the spike occurred - effectively deleting those data points so that I can impute the them with more sensible data - in the 5%-6% range....

So the DF should look something like this:

                    IR      Shift
May/30/2019        5.9%    1.67% 
May/29/2019        6%      1.69%      
May/28/2019        np.nan
May/27/2019        np.nan
May/26/2019        np.nan
May/25/2019        np.nan    
May/24/2019        5.1%    -

This is what i have so far for the loop --

for i in df1:
    if df1['col_B'][i] > .50:
        df1['col_A'][i] = np.nan

score 1 · Accepted Answer · answered Sep 27 '19 at 18:07

In order to have some not cleared area also after the "clear range", I extended your DataFrame by one row, so it contains:

           Day     IR  Shift
0  May/30/2019   5.9%  1.67%
1  May/29/2019     6%  1.69%
2  May/28/2019   5.9%  -292%
3  May/27/2019  20.2%  -1.4%
4  May/26/2019  20.5%   2.5%
5  May/25/2019    20%   292%
6  May/24/2019   5.1%     2%
7  May/23/2019   5.0%      -

And now how to solve the issue:

First define a function detecting start and end of the "clear range", based on 2 auxiliary columns, which will be created soon:

def detect(row):
    if row.Shft1 < -50:
        detect.retVal = True
    elif row.Shft2 > 50:
        detect.retVal = False
    return detect.retVal

Then compute these 2 auxiliary columns:

df['Shft1'] = df.Shift.apply(lambda x: 0.0 if x == '-' else float(x.rstrip('%')))
df['Shft2'] = df.Shft1.shift(fill_value=0)

The actual computation involves the following code:

detect.retVal=False
df.IR.mask(df.apply(detect, axis=1), np.nan, inplace=True)

Reading of the above code start from df.apply(detect, axis=1). This instruction computes the mask, indicating rows where IR value should be cleared (replaced with NaN).

Then move on to mask function itself. It is applied to df.IR column, using just computed mask, thus clearing the indicated rows.

And the last step is to drop both auxiliary columns:

df.drop(columns=['Shft1', 'Shft2'], inplace=True)

The result is:

           Day    IR  Shift
0  May/30/2019  5.9%  1.67%
1  May/29/2019    6%  1.69%
2  May/28/2019   NaN  -292%
3  May/27/2019   NaN  -1.4%
4  May/26/2019   NaN   2.5%
5  May/25/2019   NaN   292%
6  May/24/2019  5.1%     2%
7  May/23/2019  5.0%      -

Great thank you Vaildi. One thing, however. Seem to be getting an error when executing df.IR.mask(df.apply(detect, axis=1), np.nan, inplace=True). Error Indicating : df1.IR.mask(df1.apply(detect, axis=1), np.nan, inplace=True), pointing to the column name..... — Tyler, Sep 27 '19 at 20:06
ah, ok fixed it via df.['IR'].mask..but unfortunately it did not replace the values in between the large movements... — Tyler, Sep 27 '19 at 20:38
Also, I should add that none of the values are in '%' form. I mention this as i noticed that you used the rstrip function for %....not sure if is the reason for the code not working.... — Tyler, Sep 27 '19 at 20:41
Probably it is enough to delete *rstrip('%')*. But note that one cell contains "-", so this column will still have *object* type (a mixture of strings and numbers). And as far as the error with *df.IR* is concerned, maybe you use some older version of *Pandas*, which does not allow *attribute access* to a column. — Valdi_Bo, Sep 27 '19 at 21:07
oh, sorry, the ''--'' was just for the sake of clarifying the issue at hand... the actual data does not contain any strings.... — Tyler, Sep 27 '19 at 21:10
So it seems that *Shft1* column is not needed and you can use just *Shift* column instead (but do not drop it at the end). — Valdi_Bo, Sep 27 '19 at 21:12

Delete values occurring between large percentage shifts

1 Answers1