2

I have 2 dataframes, 1 has training data and the other has labels. There are 6 features/columns in the training data and 1 column in the labels data frame. I want 6 plots in my facet grid - all of them to be a scatter plot. So feature 1 vs label, feature 2 vs label, feature 3 vs label, feature 4 vs label.

Can someone show me how to do this?

for instance, using these sample data frames

In [15]: training
Out[15]:
   feature1  feature2  feature3  feature4  feature5  feature6
0         2         3         4         5         2         5
1         5         4         2         5         6         2

In [16]: labels
Out[16]:
   label
0     34
1      2

This should make 6 separate scatter plots, each with 2 data points.

BigBoy1337
  • 4,735
  • 16
  • 70
  • 138

1 Answers1

3

Seaborn has a nice FacetGrid function.You can merge your two dataframes wrap the seaborn facetgrid around a normal matplotlib.pyplot.scatter()

import pandas as pd
import random
import matplotlib.pyplot as plt
import seaborn as sns

#make a test dataframe
features = {}
for i in range(7):
    features['feature%s'%i] = [random.random() for j in range(10)]
f = pd.DataFrame(features)
labels = pd.DataFrame({'label':[random.random() for j in range(10)]})

#unstack it so feature labels are now in a single column
unstacked = pd.DataFrame(f.unstack()).reset_index()
unstacked.columns = ['feature', 'feature_index', 'feature_value']
#merge them together to get the label value for each feature value
plot_data = pd.merge(unstacked, labels, left_on = 'feature_index', right_index = True)
#wrap a seaborn facetgrid
kws = dict(s=50, linewidth=.5, edgecolor="w")
g = sns.FacetGrid(plot_data, col="feature")
g = (g.map(plt.scatter, "feature_value", "label", **kws))

enter image description here

Sam
  • 4,000
  • 20
  • 27
  • I like your answer, but there maybe a small technical mistake in the for loop - for i in range(7) ... then i is used again in "[random.random() for i in range(10)]" ... maybe that should be changed to "j" or something? – Mike Chirico Mar 03 '16 at 14:43
  • I think you'll find if you test the code that it does come up with the desired result of a randomly-generated test dataframe; but I agree the double-use of i could be a little confusing. – Sam Mar 03 '16 at 15:04
  • Ah..I'm guessing you're using Python 3? Yeah, Python version 2 leaks the control variable. Ref: http://stackoverflow.com/a/4199355/904032 ... I was running it on version 2. – Mike Chirico Mar 03 '16 at 17:06
  • true, i was on 3.X. The compatibility issues rear their ugly head again! – Sam Mar 03 '16 at 18:35
  • this is working great, except when the data for each feature is on a different scale - sometimes its 1-2000 and other times its 0-1, it still puts the xscale the exact same? Is there a way to free up each xscale to adjust to the data range its showing? – BigBoy1337 Mar 03 '16 at 20:28
  • 1
    Well, I think the purposes of using FacetGrid is to put all the variables of one axis on the same scale. So you can swap col="feature" to row="feature", and you'll get a different label axis, but then the features will have a shared axis. If you want both on different axis for every feature, you probably just want 6 different plots and not a facetgrid at all. – Sam Mar 03 '16 at 20:50
  • ah good to know. That must be why the facet grid always wants to work off of a column type division anyways - rather than with multiple features in various ranges as described here – BigBoy1337 Mar 03 '16 at 22:33
  • Using Sam's code, add the parameter `sharex = False` when calling `sns.FacetGrid()`. Each x scale will have a different range. The y axis will still have a shared range. [Source](https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.FacetGrid.html#seaborn.FacetGrid) – blue_chip Mar 11 '16 at 20:23