3

I have a dataframe 16k records and multiple groups of countries and other fields. I have produced an initial output of the a data that looks like the snipit below. Now i need to do some data cleansing, manipulating, remove skews or outliers and replace it with a value based on certain rules.

i.e. on the below how could i identify the skewed points (any value greater than 1) and replace them with the average of the next two records or previous record if there no later records.(in that group)

So in the dataframe below I would like to replace Bill%4 for IT week1 of 1.21 with the average of week2 and week3 for IT so it is 0.81.

any tricks for this?

Country Week    Bill%1  Bill%2  Bill%3  Bill%4  Bill%5  Bill%6
IT     week1    0.94    0.88    0.85    1.21    0.77    0.75
IT     week2    0.93    0.88    1.25    0.80    0.77    0.72
IT     week3    0.94    1.33    0.85    0.82    0.76    0.76
IT     week4    1.39    0.89    0.86    0.80    0.80    0.76
FR     week1    0.92    0.86    0.82    1.18    0.75    0.73
FR     week2    0.91    0.86    1.22    0.78    0.75    0.71
FR     week3    0.92    1.29    0.83    0.80    0.75    0.75
FR     week4    1.35    0.87    0.84    0.78    0.78    0.74
IcemanBerlin
  • 3,239
  • 6
  • 29
  • 33

1 Answers1

3

I don't know of any built-ins to do this, but you should be able to customize this to meet your needs, no?

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.index = list('abcdeflght')

# Define cutoff value
cutoff = 0.90

for col in df.columns: 
    # Identify index locations above cutoff
    outliers = df[col][ df[col]>cutoff ]

    # Browse through outliers and average according to index location
    for idx in outliers.index:
        # Get index location 
        loc = df.index.get_loc(idx)

        # If not one of last two values in dataframe
        if loc<df.shape[0]-2:
            df[col][loc] = np.mean( df[col][loc+1:loc+3] )
        else: 
            df[col][loc] = np.mean( df[col][loc-3:loc-1] )
tnknepp
  • 5,888
  • 6
  • 43
  • 57
  • Thanks @tnknepp yeah i kind of got your logic to work in my dataset. I am having a couple of problems though. One is I only wanted to get the mean of the next rows that relate to the same group. i.e. i need to apply a groupby to your example in my case it would be by 'Country' – IcemanBerlin Jan 03 '14 at 11:35
  • Secondly, I would need to exclude the next row from the mean calc if the next row also happened to be an outlier. – IcemanBerlin Jan 03 '14 at 11:37