0

I'm working on a dataset that has temperature values from multiple sensors in 5min intervals.

The requirement is to

  • calculate the mean from all sensors in each interval
  • if any values are > 3% from the mean (above or below) then
    • drop the highest value (ie furthest from the mean)
    • recalculate the mean
  • repeat if any remaining values are higher than the recalculated mean

This is different to other answers I've found where the entire row is dropped - I just need to successively drop the highest outlier until all values are within 3%.

I've tried a range of approaches but I'm going in circles. Help!

Normalitie
  • 59
  • 5
  • Please provide a [MRE](https://stackoverflow.com/help/minimal-reproducible-example) (also look [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples)) that replicates your problem, including the expected output. – Timus Oct 26 '22 at 08:32

1 Answers1

0

What you want to do is loop until your condition is met (no more values >3% from mean).

import pandas as pd
values= {
       "col" :[90,85,80,70,95,100]
              }
index_labels=["A","B","C","D","E","F"]
df = pd.DataFrame(values,index=index_labels)

all_within_three_percent = False #set condition to false by default
while all_within_three_percent == False: #while condition is not met, while loop keeps looping
    mean = df.col.mean() #(re)calculate mean
    three_percent_deviation = mean*0.03 #(re)calculate current deviation threshold
    df['deviation'] = abs(df.col - mean) #determine individual row deviations (absolute, so both above and below mean)
    if sum(df['deviation'] > three_percent_deviation) > 0: #when there are deviation values above the threshold
        df = df.drop(df['deviation'].idxmax()) #delete the maximum value
    else:
        all_within_three_percent = True #otherwise, condition is met: we're done, loop should be stopped
df = df.drop('deviation', axis=1) #drop the helper column deviation

returns:

    col
E   95
F   100

Note that when the difference is the same, it will remove the first occurrence. After the first iteration (removing 70), the mean is 90, so both 80 and 100 have a difference of 10. It will remove 80, not 100.

Paul
  • 1,801
  • 1
  • 12
  • 18