I'm having DataFrame containing information about signal strength in different locations. I'm trying to clean that data from outliers, but due to different locations I can't use mean or quantile on the whole dataset.
I have check number of places including Idiomatic clip on quantile for DataFrame or quantile normalization on pandas dataframe but none is providing good idiomatic solution especialy for newbies to Pandas...
I have reached something like below, yet it seems terrible especially for performance and complexity...
df = pd.DataFrame({'x': {'A': 0, "A'": 0, "A''": 0, 'B': 0, "B'": 0},
'y': {'A': 0, "A'": 0, "A''": 0, 'B': 1, "B'": 1},
'signal': {'A': 'a', "A'": 'a', "A''": 'a', 'B': 'a', "B'": 'b'},
'measure': {'A': .5, "A'": 1, "A''": 0.55, 'B': .8, "B'": 0.78}})
def remove_outliers(data, width=0.8):
lower = .5 - width/2
upper = .5 + width/2
data = data.copy(deep=True)
restructured_data = data.groupby(['x', 'y', 'signal'])
reliable_measure = restructured_data.quantile([lower, upper])
print(reliable_measure)
#TODO: Pandas way
data['is_reliable_measure'] = False
for ind in data.index:
x = data['x'][ind]
y = data['y'][ind]
signal = data['signal'][ind]
measure = data['measure'][ind]
l_group = (x, y, signal, lower)
u_group = (x, y, signal, upper)
l = reliable_measure.loc[l_group][0]
u = reliable_measure.loc[u_group][0]
is_reliable_measure = (l <= measure <= u)
data.loc[ind, 'is_reliable_measure'] = is_reliable_measure
return data # [data['is_reliable_measure']]
print (remove_outliers(df))
Result is more or less proper as allows to remove outliers:
# Quantile +- 0.4
measure
x y signal
0 0 a 0.1 0.51
0.9 0.91
1 a 0.1 0.80
0.9 0.80
b 0.1 0.78
0.9 0.78
Function output:
x y signal measure is_reliable_measure
A 0 0 a 0.50 False
A' 0 0 a 1.00 False
A'' 0 0 a 0.55 True
B 0 1 a 0.80 True
B' 0 1 b 0.78 True
So we can observe proper removal of:
- x:0, y:0, signal:a, measure: 1
- x:0, y:0, signal:a, measure: 0.5
Questions are:
- What would be idiomatic solution keeping in mind required grouping
- Is there way to decrease impact of high outliers like 1 in example above which potentially could result in preventing pretty proper 0.5 (0.5 and 0.55 are close...)?
P.S.
I believe code could be something like:
df[ df.isin(
df.groupby(['x', 'y', 'signal']).quantile([lower, upper])
)]
still not sure how to achieve that.
Following @juanpa.arrivillaga suggestion we are having:
df = pd.DataFrame({'x': {'A': 0, "A'": 0, "A''": 0, 'B': 0, "B'": 0},
'y': {'A': 0, "A'": 0, "A''": 0, 'B': 1, "B'": 1},
'signal': {'A': 'a', "A'": 'a', "A''": 'a', 'B': 'a', "B'": 'b'},
'measure': {'A': .5, "A'": 1, "A''": 0.55, 'B': .8, "B'": 0.78}})
def remove_outliers(data, width=0.8):
lower = .5 - width/2
upper = .5 + width/2
data = data.copy(deep=True)
restructured_data = data.groupby(['x', 'y', 'signal'])
reliable_measure = restructured_data.quantile([lower, upper])
print(reliable_measure)
#TODO: Pandas way
data['is_reliable_measure'] = False
for ind, x, y, signal, measure, _ in data.itertuples():
l_group = (x, y, signal, lower)
u_group = (x, y, signal, upper)
l = reliable_measure.loc[l_group][0]
u = reliable_measure.loc[u_group][0]
is_reliable_measure = (l <= measure <= u)
data.loc[ind, 'is_reliable_measure'] = is_reliable_measure
return data # [data['is_reliable_measure']]
print (remove_outliers(df))