0

I am trying to implement a Machine-Learning algorithm to predict house prices in New-York-City.

Now, when I try to plot (using Seaborn) the relationship between two columns of my house-prices dataset: 'gross_sqft_thousands' (the gross area of the property in thousands of square feets) and the target-column which is the 'sale_price_millions', I get a weird plot like this one:

gross_sqft_thousands vs sale_price_millions

Code used to plot:

sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df);

When I try to plot the number of commercial units (commercial_units column) versus the sale_price_millions, I get also a weird plot like this one:

enter image description here

These weird plots, although in the correlation matrix, the sale_price correlates very good with both variables (gross_sqft_thousands and commercial_units).

What am I doing wrong, and what should I do to get great plot, with less points and a clear fitting like this plot:

great plot with fewer points

Here is a part of my dataset:

enter image description here

ZelelB
  • 1,836
  • 7
  • 45
  • 71

1 Answers1

1

Your housing price dataset is much larger than the tips dataset shown in that Seaborn example plot, so scatter plots made with default settings will be massively overcrowded.

The second plot looks "weird" because it plots a (practically) continuous variable, sales price, against an integer-valued variable, total_units.

The following solutions come to mind:

  1. Downsample the dataset with something like sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df[::10]). The [::10] part selects every 10th line from clean_df. You could also try clean_df.sample(frac=0.1, random_state=12345), which randomly samples 10% of all rows without replacement (using a random seed for reproducibility).

  2. Reduce the alpha (opacity) and/or size of the scatterplot points with sns.regplot(x="sale_price_millions", y="gross_sqft_thousands", data=clean_df, scatter_kws={"alpha": 0.1, "s": 1}).

  3. For plot 2, add a bit of "jitter" (random noise) to the y-axis variable with sns.regplot(..., y_jitter=0.05).

For more, check out the Seaborn documentation on regplot: https://seaborn.pydata.org/generated/seaborn.regplot.html

Peter Leimbigler
  • 10,775
  • 1
  • 23
  • 37
  • Great to hear! I just added a third suggestion to make the second plot more readable: add `y_jitter=0.05` or some other small value to the arguments of `regplot`. Happy coding! – Peter Leimbigler Jun 02 '19 at 16:02
  • Cool, thanks a lot! Helped! Would you mind looking at this question: https://stackoverflow.com/questions/56417430/how-to-get-legend-next-to-plot-in-seaborn – ZelelB Jun 02 '19 at 17:38