0

I'm having DataFrame containing information about signal strength in different locations. I'm trying to clean that data from outliers, but due to different locations I can't use mean or quantile on the whole dataset.

I have check number of places including Idiomatic clip on quantile for DataFrame or quantile normalization on pandas dataframe but none is providing good idiomatic solution especialy for newbies to Pandas...

I have reached something like below, yet it seems terrible especially for performance and complexity...

df = pd.DataFrame({'x': {'A': 0, "A'": 0, "A''": 0, 'B': 0, "B'": 0},
                   'y': {'A': 0, "A'": 0, "A''": 0, 'B': 1, "B'": 1},
                   'signal': {'A': 'a', "A'": 'a', "A''": 'a', 'B': 'a', "B'": 'b'},
                   'measure': {'A': .5, "A'": 1, "A''": 0.55, 'B': .8, "B'": 0.78}})

def remove_outliers(data, width=0.8):
    lower = .5 - width/2
    upper = .5 + width/2
    data = data.copy(deep=True)
    restructured_data = data.groupby(['x', 'y', 'signal'])
    reliable_measure = restructured_data.quantile([lower, upper])

    print(reliable_measure)

    #TODO: Pandas way
    data['is_reliable_measure'] = False
    for ind in data.index:
        x = data['x'][ind]
        y = data['y'][ind]
        signal = data['signal'][ind]
        measure = data['measure'][ind]

        l_group = (x, y, signal, lower)
        u_group = (x, y, signal, upper)

        l = reliable_measure.loc[l_group][0]
        u = reliable_measure.loc[u_group][0]
        is_reliable_measure = (l <= measure <= u)
        data.loc[ind, 'is_reliable_measure'] = is_reliable_measure

    return data # [data['is_reliable_measure']]

print (remove_outliers(df))

Result is more or less proper as allows to remove outliers:

# Quantile +- 0.4
                measure
x y signal             
0 0 a      0.1     0.51
           0.9     0.91
  1 a      0.1     0.80
           0.9     0.80
    b      0.1     0.78
           0.9     0.78

Function output:

     x  y signal  measure  is_reliable_measure
A    0  0      a     0.50                False
A'   0  0      a     1.00                False
A''  0  0      a     0.55                 True
B    0  1      a     0.80                 True
B'   0  1      b     0.78                 True

So we can observe proper removal of:

  • x:0, y:0, signal:a, measure: 1
  • x:0, y:0, signal:a, measure: 0.5

Questions are:

  1. What would be idiomatic solution keeping in mind required grouping
  2. Is there way to decrease impact of high outliers like 1 in example above which potentially could result in preventing pretty proper 0.5 (0.5 and 0.55 are close...)?

P.S.

I believe code could be something like:

df[ df.isin(
        df.groupby(['x', 'y', 'signal']).quantile([lower, upper])
)]

still not sure how to achieve that.


Following @juanpa.arrivillaga suggestion we are having:

df = pd.DataFrame({'x': {'A': 0, "A'": 0, "A''": 0, 'B': 0, "B'": 0},
                   'y': {'A': 0, "A'": 0, "A''": 0, 'B': 1, "B'": 1},
                   'signal': {'A': 'a', "A'": 'a', "A''": 'a', 'B': 'a', "B'": 'b'},
                   'measure': {'A': .5, "A'": 1, "A''": 0.55, 'B': .8, "B'": 0.78}})

def remove_outliers(data, width=0.8):
    lower = .5 - width/2
    upper = .5 + width/2
    data = data.copy(deep=True)
    restructured_data = data.groupby(['x', 'y', 'signal'])
    reliable_measure = restructured_data.quantile([lower, upper])

    print(reliable_measure)

    #TODO: Pandas way
    data['is_reliable_measure'] = False
    for ind, x, y, signal, measure, _ in data.itertuples():
        l_group = (x, y, signal, lower)
        u_group = (x, y, signal, upper)

        l = reliable_measure.loc[l_group][0]
        u = reliable_measure.loc[u_group][0]
        is_reliable_measure = (l <= measure <= u)
        data.loc[ind, 'is_reliable_measure'] = is_reliable_measure

    return data # [data['is_reliable_measure']]

print (remove_outliers(df))
  • `for ind in data.index` use `itertuples` if you are going to do that. – juanpa.arrivillaga Jun 21 '19 at 22:35
  • I was thinking of using `itertuples`, but I was not having idea how to perform: `data.loc[ind, 'is_reliable_measure'] = is_reliable_measure` My only concept was to `enumerate` over itertuples which seems even more messy... – Aleksander Budziński Jun 22 '19 at 11:30
  • @juanpa.arrivillaga regarding iteration I have found https://www.geeksforgeeks.org/different-ways-to-iterate-over-rows-in-pandas-dataframe/ still seems not getting closer on how to better filter outliers :( – Aleksander Budziński Jun 22 '19 at 15:34
  • the tuple includes the index. – juanpa.arrivillaga Jun 22 '19 at 17:44
  • I have found [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler) and I have feeling that it may drive towards much cleaner solution. – Aleksander Budziński Jun 23 '19 at 22:45

0 Answers0