Idiomatic cleaning on measure data from DataFrame using quantile

Question

I'm having DataFrame containing information about signal strength in different locations. I'm trying to clean that data from outliers, but due to different locations I can't use mean or quantile on the whole dataset.

I have check number of places including Idiomatic clip on quantile for DataFrame or quantile normalization on pandas dataframe but none is providing good idiomatic solution especialy for newbies to Pandas...

I have reached something like below, yet it seems terrible especially for performance and complexity...

df = pd.DataFrame({'x': {'A': 0, "A'": 0, "A''": 0, 'B': 0, "B'": 0},
                   'y': {'A': 0, "A'": 0, "A''": 0, 'B': 1, "B'": 1},
                   'signal': {'A': 'a', "A'": 'a', "A''": 'a', 'B': 'a', "B'": 'b'},
                   'measure': {'A': .5, "A'": 1, "A''": 0.55, 'B': .8, "B'": 0.78}})

def remove_outliers(data, width=0.8):
    lower = .5 - width/2
    upper = .5 + width/2
    data = data.copy(deep=True)
    restructured_data = data.groupby(['x', 'y', 'signal'])
    reliable_measure = restructured_data.quantile([lower, upper])

    print(reliable_measure)

    #TODO: Pandas way
    data['is_reliable_measure'] = False
    for ind in data.index:
        x = data['x'][ind]
        y = data['y'][ind]
        signal = data['signal'][ind]
        measure = data['measure'][ind]

        l_group = (x, y, signal, lower)
        u_group = (x, y, signal, upper)

        l = reliable_measure.loc[l_group][0]
        u = reliable_measure.loc[u_group][0]
        is_reliable_measure = (l <= measure <= u)
        data.loc[ind, 'is_reliable_measure'] = is_reliable_measure

    return data # [data['is_reliable_measure']]

print (remove_outliers(df))

Result is more or less proper as allows to remove outliers:

# Quantile +- 0.4
                measure
x y signal             
0 0 a      0.1     0.51
           0.9     0.91
  1 a      0.1     0.80
           0.9     0.80
    b      0.1     0.78
           0.9     0.78

Function output:

     x  y signal  measure  is_reliable_measure
A    0  0      a     0.50                False
A'   0  0      a     1.00                False
A''  0  0      a     0.55                 True
B    0  1      a     0.80                 True
B'   0  1      b     0.78                 True

So we can observe proper removal of:

x:0, y:0, signal:a, measure: 1
x:0, y:0, signal:a, measure: 0.5

Questions are:

What would be idiomatic solution keeping in mind required grouping
Is there way to decrease impact of high outliers like 1 in example above which potentially could result in preventing pretty proper 0.5 (0.5 and 0.55 are close...)?

P.S.

I believe code could be something like:

df[ df.isin(
        df.groupby(['x', 'y', 'signal']).quantile([lower, upper])
)]

still not sure how to achieve that.

Following @juanpa.arrivillaga suggestion we are having:

df = pd.DataFrame({'x': {'A': 0, "A'": 0, "A''": 0, 'B': 0, "B'": 0},
                   'y': {'A': 0, "A'": 0, "A''": 0, 'B': 1, "B'": 1},
                   'signal': {'A': 'a', "A'": 'a', "A''": 'a', 'B': 'a', "B'": 'b'},
                   'measure': {'A': .5, "A'": 1, "A''": 0.55, 'B': .8, "B'": 0.78}})

def remove_outliers(data, width=0.8):
    lower = .5 - width/2
    upper = .5 + width/2
    data = data.copy(deep=True)
    restructured_data = data.groupby(['x', 'y', 'signal'])
    reliable_measure = restructured_data.quantile([lower, upper])

    print(reliable_measure)

    #TODO: Pandas way
    data['is_reliable_measure'] = False
    for ind, x, y, signal, measure, _ in data.itertuples():
        l_group = (x, y, signal, lower)
        u_group = (x, y, signal, upper)

        l = reliable_measure.loc[l_group][0]
        u = reliable_measure.loc[u_group][0]
        is_reliable_measure = (l <= measure <= u)
        data.loc[ind, 'is_reliable_measure'] = is_reliable_measure

    return data # [data['is_reliable_measure']]

print (remove_outliers(df))

`for ind in data.index` use `itertuples` if you are going to do that. — juanpa.arrivillaga, Jun 21 '19 at 22:35
I was thinking of using `itertuples`, but I was not having idea how to perform: `data.loc[ind, 'is_reliable_measure'] = is_reliable_measure` My only concept was to `enumerate` over itertuples which seems even more messy... — Aleksander Budziński, Jun 22 '19 at 11:30
@juanpa.arrivillaga regarding iteration I have found https://www.geeksforgeeks.org/different-ways-to-iterate-over-rows-in-pandas-dataframe/ still seems not getting closer on how to better filter outliers :( — Aleksander Budziński, Jun 22 '19 at 15:34
I have found [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler) and I have feeling that it may drive towards much cleaner solution. — Aleksander Budziński, Jun 23 '19 at 22:45

Idiomatic cleaning on measure data from DataFrame using quantile

0 Answers0