0

my input pandas dataframe would be the following:

        timestamp  value       value_diff
0           1       100            NaN
1           2       110           10.0
2           3       112            2.0
3           4       130           18.0
4           5      8100         7970.0
5           6      8102            2.0
6           7      7998         -104.0
7           8      5000        -2998.0
8           9       140        -4860.0
9          10       180           40.0
10         11       200           20.0
11         12       205            5.0
12         13       230           25.0
13         14      8200         7970.0
14         15      3000        -5200.0
15         16       235        -2765.0
16         17       247           12.0
17         18       255            8.0
18         19       280           25.0
19         20       302           22.0
20         21       303            1.0
21         22       315           12.0
22         23       330           15.0
23         24       350           20.0
24         25       360           10.0

What I wanted to do is to remove all the data which it's difference is greater than 200 (in absolute value) and the data in between.

Therefore, the output would be something like this:

        timestamp  value       value_diff
0           1       100            NaN
1           2       110           10.0
2           3       112            2.0
3           4       130           18.0
9          10       180           40.0
10         11       200           20.0
11         12       205            5.0
12         13       230           25.0
16         17       247           12.0
17         18       255            8.0
18         19       280           25.0
19         20       302           22.0
20         21       303            1.0
21         22       315           12.0
22         23       330           15.0
23         24       350           20.0
24         25       360           10.0

Thank you very much in advance!

FranG91
  • 83
  • 9

3 Answers3

1

Maybe you may be able to use the IQR statistical method to filter out your outliers:

q1, q3 = df['value'].quantile([0.25, 0.75])
iqr = q3 - q1
lb, ub = q1 - 1.5 * iqr, q3 + 1.5 * iqr

out = df[df['value'].between(lb, ub)]

Output:

>>> out
    timestamp  value
0           1    100
1           2    110
2           3    112
3           4    130
8           9    140
9          10    180
10         11    200
11         12    205
12         13    230
15         16    235
16         17    247
17         18    255
18         19    280
19         20    302
20         21    303
21         22    315
22         23    330
23         24    350
24         25    360
Corralien
  • 109,409
  • 8
  • 28
  • 52
0

I am not sure which one are you expect.

  1. Use first value as baseline, and update baseline in loop.
  2. Use medium as baseline, in case first value is spike too.
# value = df['value'].iloc[0] # Use first value as baseline
value = df['value'].quantile(0.5) # Use medium as baseline
rows = []
for i, row in df.iterrows():
    if abs(row['value'] - value) <= 200:
        rows.append(row)
        value = row['value']
new_df = pd.DataFrame(rows)

output:

    timestamp  value
0           1    100
1           2    110
2           3    112
3           4    130
8           9    140
9          10    180
10         11    200
11         12    205
12         13    230
15         16    235
16         17    247
17         18    255
18         19    280
19         20    302
20         21    303
21         22    315
22         23    330
23         24    350
24         25    360
Shuo
  • 1,512
  • 1
  • 3
  • 13
  • 1
    The logic is a good idea, but using `iterrows` is a terrible implementation. Please update with vectorial code (then I'd be happy to upvote) – mozway May 05 '23 at 09:19
0

you can do this with Loc

df.loc[df.value_diff.abs()<200]
Jamie
  • 146
  • 7