1

I have a dataset with 17 features and 14k observations.

I would like to plot the price distribution to get a better understanding. price feature has a float64 data type enter image description here

Plotting the price distribution gives me the following

The distribution looks like this enter image description here

Why does this plot looks like this? Something wrong with my data? What's the proper way to solve this?

code:

fig, ax = plt.subplots(1, 1, figsize = (9,5))
data['sale_price'].hist(bins=50, ax=ax)
plt.xlabel('Price')
plt.title('Distribution of prices')
plt.ylabel('Number of houses')
desertnaut
  • 57,590
  • 26
  • 140
  • 166
user3641381
  • 1,036
  • 3
  • 15
  • 29

1 Answers1

0

It seems your histogram is heavily Long-Tailed. As you have prices up to 3*1e7 while the majority of your data are much smaller, in the order of 1e6. So the bin=50 parameter does such that the first bin includes almost all of the data. possible treatments:

  • Use logarithmic bins (see this post)
  • choose bins according to 0-75 quantiles

However note that the 2nd solution creates an ugly accumulation of value count at the right tail of the histogram, maybe not desired. Still... It depends on the data. I'd use logarithmic histogram for house prices. I guess it makes more sense in terms of visualization

Alireza
  • 656
  • 1
  • 6
  • 20
  • Thank you for explaining it. Really helps. So, would you say to either use logarithmic, but also collect much more data? – user3641381 May 28 '20 at 19:21
  • Collecting more data wouldn't make a difference I might say, as _this_ histogram shows the true distribution of your population (unless your data collection is more biased to lower price houses). Your visualization method should be refined... in this case that logarithmic bins does the job. – Alireza May 28 '20 at 19:43
  • Right. Thanks again. Just so I understand.. are you saying that my dataset has much higher values (high prices) than lower prices? Do I understand this correctly? – user3641381 May 28 '20 at 19:51
  • No... what I am saying, and what your data is presenting, is that the majority of your data are in the range of million (1e6) responsible for that tall bin in the left. but you have lower number of high prices, which makes your histogram like this. think of it like this: 99% of houses cost about 1 million, but in your data there are also a few number of huge luxury houses worth 10 millions, which the bins of your histogram should cover. your array is something like: `[1, 1, 1, 1, 1.5, 2, 2, 2.5, 3, 3, 10, 25, 100, 1000]` a lot of values around `1`, but then some expensive houses worth `1000` – Alireza May 28 '20 at 19:57
  • you can also apply log scale to the counts (y axis). try this `log=True` in the `hist` method. it might reveal what you are dealing with (if it works, let me know so that I can update my answer) – Alireza May 28 '20 at 20:05