0

Good day. The problem is the following - when trying to remove outliers from one of the columns in the table

import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy import stats
import numpy as np

df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/LargeData/m2_survey_data.csv")
df["ConvertedComp"].plot(kind="box", figsize=(10,10))
z_scores = stats.zscore(df["ConvertedComp"])
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3).all(axis=1)
new_df = df[filtered_entries]

the following error crashes.

---------------------------------------------------------------------------
AxisError                                 Traceback (most recent call last)
<ipython-input-133-7811da442811> in <module>
      4 z_scores
      5 abs_z_scores = np.abs(z_scores)
----> 6 filtered_entries = (abs_z_scores < 3).all(axis=1)
      7 #new_df = df[filtered_entries]

C:\ProgramData\WatsonStudioDesktop\miniconda3\envs\desktop\lib\site-packages\numpy\core\_methods.py in _all(a, axis, dtype, out, keepdims)
     44 
     45 def _all(a, axis=None, dtype=None, out=None, keepdims=False):
---> 46     return umr_all(a, axis, dtype, out, keepdims)
     47 
     48 def _count_reduce_items(arr, axis):

AxisError: axis 1 is out of bounds for array of dimension 1

I would be grateful for your advice, ideas are almost over

maric92
  • 43
  • 4

1 Answers1

1

Your zscore is computed over only 1 column, so the result is a one-dimensional array

z_scores = stats.zscore(df["ConvertedComp"])
new_df = df[np.abs(z_scores) < 3]

Now if you run zscore over multiple column, then your original code would have worked:

z_scores = stats.zscore(df[["ConvertedComp", 'AnotherColumn']])
new_df = df[(np.abs(z_scores) < 3).all(axis=1)]
Code Different
  • 90,614
  • 16
  • 144
  • 163