1

Current plot and anticipated plot

Im new to python. I'm trying to get a subset of the housing index dataset from https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb

I have imported the dataset as 'housing'. I am trying to plot just the outliers in quantile 0.95 on top of the plot which shows all the values for median_house_value

import matplotlib.image as mpimg

housing.plot(kind="scatter", x="median_income", y="median_house_value",
             alpha=0.1)

this gets a plot of all the rows (i), i am trying to select the corresponding median_income rows for the subset of median_house_value that is the 0.95 quantile and plot them over the top in orange (j)

Below is my best attempt so far, which is not getting the correct values

plt.plot(housing.groupby('median_house_value').quantile(q=quant)["median_income"], housing.groupby('median_house_value').quantile(q=quant).index.get_level_values('median_house_value'),"or")

I can get the median_house_value rows in the quantile by doing..

quantile = int(round(housing["median_house_value"].quantile(q=0.95)))
housing.median_house_value > quantile

I want to end up with two panda arrays, one for the x axis, an array of median_income rows that correspond to the second array which would be an array of median_house_value rows that make up the quantile

Thanks in advance.

Happy Machine
  • 987
  • 8
  • 30
  • Is this a data question or plotting question? Please ask one question per post. Also, please include a [data sample](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). If for plotting and your code reproduces the undesired plot, please sketch out (either with software or hand), your desired plot as it is hard to tell with words what you need: *plot them over the top in orange*. – Parfait Dec 26 '18 at 18:37
  • Its a data question, im including the plot code for clarity – Happy Machine Dec 26 '18 at 19:56

1 Answers1

1

IIUC - Simply filter your main dataset since you have a boolean index: housing["median_house_value"] > quantile.

# REQUIRED THRESHOLD
quantile = int(round(housing["median_house_value"].quantile(q=0.95)))    
# FILTER BY BOOLEAN 
upper_housing = housing[housing["median_house_value"] > quantile]

# PLOTTING
housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1, c='blue')

upper_housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1, c='red')

plt.show()
Parfait
  • 104,375
  • 17
  • 94
  • 125