Dealing with outliers in Pandas

Question

Good day. The problem is the following - when trying to remove outliers from one of the columns in the table

import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy import stats
import numpy as np

df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/LargeData/m2_survey_data.csv")
df["ConvertedComp"].plot(kind="box", figsize=(10,10))
z_scores = stats.zscore(df["ConvertedComp"])
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3).all(axis=1)
new_df = df[filtered_entries]

the following error crashes.

---------------------------------------------------------------------------
AxisError                                 Traceback (most recent call last)
<ipython-input-133-7811da442811> in <module>
      4 z_scores
      5 abs_z_scores = np.abs(z_scores)
----> 6 filtered_entries = (abs_z_scores < 3).all(axis=1)
      7 #new_df = df[filtered_entries]

C:\ProgramData\WatsonStudioDesktop\miniconda3\envs\desktop\lib\site-packages\numpy\core\_methods.py in _all(a, axis, dtype, out, keepdims)
     44 
     45 def _all(a, axis=None, dtype=None, out=None, keepdims=False):
---> 46     return umr_all(a, axis, dtype, out, keepdims)
     47 
     48 def _count_reduce_items(arr, axis):

AxisError: axis 1 is out of bounds for array of dimension 1

I would be grateful for your advice, ideas are almost over

An array of dimension 1 (e.g. a simple list of numbers) has only one axis: axis 0. — couka, Jan 02 '21 at 22:47

score 1 · Accepted Answer · answered Jan 02 '21 at 23:08

Your zscore is computed over only 1 column, so the result is a one-dimensional array

z_scores = stats.zscore(df["ConvertedComp"])
new_df = df[np.abs(z_scores) < 3]

Now if you run zscore over multiple column, then your original code would have worked:

z_scores = stats.zscore(df[["ConvertedComp", 'AnotherColumn']])
new_df = df[(np.abs(z_scores) < 3).all(axis=1)]

Dealing with outliers in Pandas

1 Answers1