4

Given a pandas dataframe, I want to exclude rows corresponding to outliers (Z-value = 3) based on one of the columns.

The dataframe looks like this:

df.dtypes
_id                   object
_index                object
_score                object
_source.address       object
_source.district      object
_source.price        float64
_source.roomCount    float64
_source.size         float64
_type                 object
sort                  object
priceSquareMeter     float64
dtype: object

For the line:

dff=df[(np.abs(stats.zscore(df)) < 3).all(axis='_source.price')]

The following exception is raised:

-------------------------------------------------------------------------    
TypeError                                 Traceback (most recent call last)
<ipython-input-68-02fb15620e33> in <module>()
----> 1 dff=df[(np.abs(stats.zscore(df)) < 3).all(axis='_source.price')]

/opt/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py in zscore(a, axis, ddof)
   2239     """
   2240     a = np.asanyarray(a)
-> 2241     mns = a.mean(axis=axis)
   2242     sstd = a.std(axis=axis, ddof=ddof)
   2243     if axis and mns.ndim < a.ndim:

/opt/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py in _mean(a, axis, dtype, out, keepdims)
     68             is_float16_result = True
     69 
---> 70     ret = umr_sum(arr, axis, dtype, out, keepdims)
     71     if isinstance(ret, mu.ndarray):
     72         ret = um.true_divide(

TypeError: unsupported operand type(s) for +: 'NoneType' and 'NoneType'

And the return value of

np.isreal(df['_source.price']).all()

is

True

Why do I get the above exception, and how can I exclude the outliers?

Frank
  • 66,179
  • 8
  • 96
  • 180
bayerb
  • 649
  • 2
  • 9
  • 28

3 Answers3

13

If one wants to use the Interquartile Range of a given dataset (i.e. IQR, as shown by a Wikipedia image below) (Ref):

def Remove_Outlier_Indices(df):
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    trueList = ~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR)))
    return trueList

Based on the above eliminator function, the subset of outliers according to the dataset' statistical content can be obtained:

# Arbitrary Dataset for the Example
df = pd.DataFrame({'Data':np.random.normal(size=200)})

# Index List of Non-Outliers
nonOutlierList = Remove_Outlier_Indices(df)

# Non-Outlier Subset of the Given Dataset
dfSubset = df[nonOutlierList]

interquartile range

Herpes Free Engineer
  • 2,425
  • 2
  • 27
  • 34
  • Exactly what I was looking for. Thumbs up. Thanks. – Chocksmith May 28 '21 at 14:57
  • 1
    For those who want to apply this to several quantitative columns (Data1, Data2, etc) at the same time, add .all(1): `dfSubset = df[nonOutlierList.all(1)]` – vpvinc Oct 15 '21 at 13:58
4

Use this boolean whenever you have this sort of issue:

df=pd.DataFrame({'Data':np.random.normal(size=200)})  #example 
df[np.abs(df.Data-df.Data.mean())<=(3*df.Data.std())] #keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.
df[~(np.abs(df.Data-df.Data.mean())>(3*df.Data.std()))] #or the other way around
liam
  • 1,918
  • 3
  • 22
  • 28
  • `AttributeError: 'DataFrame' object has no attribute 'Data'` – Nairum Apr 19 '21 at 14:19
  • 1
    @Nairum `Data` is the column of `df` being accessed. It needs to be changed to fit the actual column of your specific DataFrame you want to detect outliers in. – B-Schmidt Jun 23 '21 at 19:38
2

I believe you could create a boolean filter with the outliers and then select the oposite of it.

outliers = stats.zscore(df['_source.price']).apply(lambda x: np.abs(x) == 3)
df_without_outliers = df[~outliers]