0
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

df=pd.read_csv("C:\\Users\\USER\\Desktop\\ML\\XYZCorp_LendingData.txt", sep='\t',low_memory=False)

df['annual_inc'].head()
'''
0    24000.0
1    30000.0
2    12252.0
3    49200.0
4    80000.0
Name: annual_inc, dtype: float64
'''

sns.set(style="whitegrid")
sns.distplot(df['annual_inc'])

actual plot

This is the desired output, Which I am unable to obtain:

desired plot

JohanC
  • 71,591
  • 8
  • 33
  • 66

1 Answers1

2

Income usually follows some kind of pareto distribution (Pareto is also known from the related 80-20 rule). This means a huge number of small numbers and very sparse larger numbers, some of which can be extremely high.

Such a distribution doesn't lend itself well for a histogram nor for a kdeplot. Almost all values end up it the first few bins, with sparse numbers ending up in some of the last bins.

See this and this post about why even just trying to fit some parameters can be extremely tricky.

Now, to get a reasonable histogram, one can limit the plot to just the first part. The number of bins needs to be incremented (because most will be cut off by the changed limits).

x = np.random.pareto(1.4, size=1000)
ax = sns.distplot(x, bins=200)
ax.set_xlim(0, 10)

Alternatively, one could cut off all the high values and draw a default distplot. For example np.percentile(x, 95) calculates the 95th percentile.

from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns

x = np.random.pareto(1.4, size=1000)
cut_off = np.percentile(x, 95)  # get the 95th percentile
sns.distplot(x[x < cut_off])
plt.show()

example distplot

PS: Note that the income for loan applications doesn't follow a strict Pareto distribution. The lowest incomes will be filtered away before reaching the loan application step, and the highest incomes don't obtain their loans via standard ways. Nevertheless, to get a nicer plot, limiting the range, augmenting the number of bins and experimenting with the bandwidth (kde_kws={"bw":bandwidth}) are still very useful.

Also for a violinplot, cutting off the highest values could be helpful. For the box plot, it would be better to just limit the x-range (ax.set_xlim(0, ...). Optionally, you could try ax.set_xscale('log') to show the plot with a logarithmic scale.

JohanC
  • 71,591
  • 8
  • 33
  • 66