0

So i am trying to plot the distribution of my dataset, in order to find outliers using IQR. However instead of displaying say (0 to 100000) the x axis scale is from 0 to 1, with almost all of the data clustered at 0, despite me having removed all null values. Could someone please explain where i have gone wrong and why the scale of my plot is only 0 to 1, below is the full code and an image of the plot. The dataset has an IQR of 51770 so this scale of 0 - 1 cannot be right, or is a reduced version.

This also is not particularly useful or correct as instead of having an outlier list with say 10 IQR values there are too many to count.

import warnings
warnings.simplefilter(action = 'ignore', category = FutureWarning)

from IPython.display import display

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

pd.options.mode.chained_assignment = None

invest_2019 = pd.read_csv("Investment_2019.csv")
invest_2019['Investment2019'][invest_2019['Investment2019'] < 0] = np.nan
invest_2019.dropna(inplace = True)

invest_2019.isnull().sum()

x = invest_2019['Investment2019']

# Detect Outliers:
sns.boxplot(x)  # initial plot
plt.show()

Q1 = x.quantile(0.25)
Q3 = x.quantile(0.75)
IQR = Q3 - Q1
print("IQR: ", IQR, "\n")

b = Q1 - (1.5*IQR)
t = Q3 + (1.5*IQR)
r = t-b
print("bottom shadow:", b)
print("top shadow", t)
print("range: ", r, "\n")

outlie = x[(x < (Q1 - 1.5 * IQR)) | (x > (Q3 + 1.5 * IQR))]

outlie

Boxplot

0 Answers0