5

I'd like a function in Matplotlib similar to the Matlab 'scatterhist' function which takes continuous values for 'x' and 'y' axes, plus a categorical variable as input; and produces a scatter plot with marginal KDE plots and two or more categorical variables in different colours as output: Matlab / Mathworks 'scatterhist' example I've found examples of scatter plots with marginal histograms in Matplotlib, marginal histograms in Seaborn jointplot, overlapping histograms in Matplotlib and marginal KDE plots in Matplotib ; but I haven't found any examples which combine scatter plots with marginal KDE plots and are colour coded to indicate different categories.

If possible, I'd like a solution which uses 'vanilla' Matplotlib without Seaborn, as this will avoid dependencies and allow complete control and customisation of the plot appearance using standard Matplotlib commands.

I was going to try to write something based on the above examples; but before doing so wanted to check whether a similar function was already available, and if not then would be grateful for any guidance on the best approach to use.

Dave
  • 515
  • 1
  • 8
  • 17
  • Disentangle the problem. Can you draw a scatter plot with different colors? Can you draw a KDE plot? Can you position the axes in the desired way? Combined, those will give you the final graph. – ImportanceOfBeingErnest Jul 30 '19 at 20:03
  • Thanks for your comment. If no similar function already exists, I was planning a step-wise approach to writing the code as you suggest, using a combination of techniques from each of the different examples linked above. My input data will probably be in the form of a .csv file, with two columns of continuous variables, and one column of categorical variable. I'm therefore wondering whether it would be better to use the Pandas library to import and assign the variables rather than trying to do this in numpy ? – Dave Aug 01 '19 at 18:38
  • For data manipulation pandas is a great tool; but it's not necessary. If you decide to use pandas, you can always come back to numpy arrays via `df["column"].values`. So essentially you cannot go wrong in either case. – ImportanceOfBeingErnest Aug 01 '19 at 18:51

2 Answers2

1

@ImportanceOfBeingEarnest: Many thanks for your help. Here's my first attempt at a solution. It's a bit hacky but achieves my objectives, and is fully customisable using standard matplotlib commands. I'm posting the code here with annotations in case anyone else wishes to use it or develop it further. If there are any improvements or neater ways of writing the code I'm always keen to learn and would be grateful for guidance. Scatter_MarginalKDE

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
from scipy import stats

label = ['Setosa','Versicolor','Virginica'] # List of labels for categories
cl = ['b','r','y'] # List of colours for categories
categories = len(label)
sample_size = 20 # Number of samples in each category

# Create numpy arrays for dummy x and y data:
x = np.zeros(shape=(categories, sample_size))
y = np.zeros(shape=(categories, sample_size))

# Generate random data for each categorical variable:
for n in range (0, categories):
    x[n,:] = np.array(np.random.randn(sample_size)) + 4 + n
    y[n,:] = np.array(np.random.randn(sample_size)) + 6 - n

# Set up 4 subplots as axis objects using GridSpec:
gs = gridspec.GridSpec(2, 2, width_ratios=[1,3], height_ratios=[3,1])
# Add space between scatter plot and KDE plots to accommodate axis labels:
gs.update(hspace=0.3, wspace=0.3)

# Set background canvas colour to White instead of grey default
fig = plt.figure()
fig.patch.set_facecolor('white')

ax = plt.subplot(gs[0,1]) # Instantiate scatter plot area and axis range
ax.set_xlim(x.min(), x.max())
ax.set_ylim(y.min(), y.max())
ax.set_xlabel('x')
ax.set_ylabel('y')

axl = plt.subplot(gs[0,0], sharey=ax) # Instantiate left KDE plot area
axl.get_xaxis().set_visible(False) # Hide tick marks and spines
axl.get_yaxis().set_visible(False)
axl.spines["right"].set_visible(False)
axl.spines["top"].set_visible(False)
axl.spines["bottom"].set_visible(False)

axb = plt.subplot(gs[1,1], sharex=ax) # Instantiate bottom KDE plot area
axb.get_xaxis().set_visible(False) # Hide tick marks and spines
axb.get_yaxis().set_visible(False)
axb.spines["right"].set_visible(False)
axb.spines["top"].set_visible(False)
axb.spines["left"].set_visible(False)

axc = plt.subplot(gs[1,0]) # Instantiate legend plot area
axc.axis('off') # Hide tick marks and spines

# Plot data for each categorical variable as scatter and marginal KDE plots:
for n in range (0, categories):
    ax.scatter(x[n],y[n], color='none', label=label[n], s=100, edgecolor= cl[n])

    kde = stats.gaussian_kde(x[n,:])
    xx = np.linspace(x.min(), x.max(), 1000)
    axb.plot(xx, kde(xx), color=cl[n])

    kde = stats.gaussian_kde(y[n,:])
    yy = np.linspace(y.min(), y.max(), 1000)
    axl.plot(kde(yy), yy, color=cl[n])

# Copy legend object from scatter plot to lower left subplot and display:
# NB 'scatterpoints = 1' customises legend box to show only 1 handle (icon) per label 
handles, labels = ax.get_legend_handles_labels()
axc.legend(handles, labels, scatterpoints = 1, loc = 'center', fontsize = 12)

plt.show()`

`

Dave
  • 515
  • 1
  • 8
  • 17
1

Version 2, using Pandas to import 'real' data from a csv file, with a different number of entries in each category. (csv file format: row 0 = headers; col 0 = x values, col 1 = y values, col 2 = category labels). Scatterplot axis and legend labels are generated from column headers.

enter image description here

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
from scipy import stats
import pandas as pd

"""
Create scatter plot with marginal KDE plots 
from csv file with 3 cols of data 
formatted as following example (first row of 
data are headers):
'x_label', 'y_label', 'category_label'
4,5,'virginica'
3,6,'sentosa'
4,6, 'virginica'  etc...
"""

df = pd.read_csv('iris_2.csv') # enter filename for csv file to be imported (within current working directory)
cl = ['b','r','y', 'g', 'm', 'k'] # Custom list of colours for each categories - increase as needed...

headers = list(df.columns) # Extract list of column headers
# Find min and max values for all x (= col [0]) and y (= col [1]) in dataframe:
xmin, xmax = df.min(axis=0)[0], df.max(axis=0)[0]
ymin, ymax = df.min(axis=0)[1], df.max(axis=0)[1]
# Create a list of all unique categories which occur in the right hand column (ie index '2'):
category_list = df.ix[:,2].unique()

# Set up 4 subplots and aspect ratios as axis objects using GridSpec:
gs = gridspec.GridSpec(2, 2, width_ratios=[1,3], height_ratios=[3,1])
# Add space between scatter plot and KDE plots to accommodate axis labels:
gs.update(hspace=0.3, wspace=0.3)

fig = plt.figure() # Set background canvas colour to White instead of grey default
fig.patch.set_facecolor('white')

ax = plt.subplot(gs[0,1]) # Instantiate scatter plot area and axis range
ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)
ax.set_xlabel(headers[0], fontsize = 14)
ax.set_ylabel(headers[1], fontsize = 14)
ax.yaxis.labelpad = 10 # adjust space between x and y axes and their labels if needed

axl = plt.subplot(gs[0,0], sharey=ax) # Instantiate left KDE plot area
axl.get_xaxis().set_visible(False) # Hide tick marks and spines
axl.get_yaxis().set_visible(False)
axl.spines["right"].set_visible(False)
axl.spines["top"].set_visible(False)
axl.spines["bottom"].set_visible(False)

axb = plt.subplot(gs[1,1], sharex=ax) # Instantiate bottom KDE plot area
axb.get_xaxis().set_visible(False) # Hide tick marks and spines
axb.get_yaxis().set_visible(False)
axb.spines["right"].set_visible(False)
axb.spines["top"].set_visible(False)
axb.spines["left"].set_visible(False)

axc = plt.subplot(gs[1,0]) # Instantiate legend plot area
axc.axis('off') # Hide tick marks and spines

# For each category in the list...
for n in range(0, len(category_list)):
# Create a sub-table containing only entries matching current category:
    st = df.loc[df[headers[2]] == category_list[n]]
    # Select first two columns of sub-table as x and y values to be plotted:
    x = st[headers[0]]
    y = st[headers[1]]

    # Plot data for each categorical variable as scatter and marginal KDE plots:    
    ax.scatter(x,y, color='none', s=100, edgecolor= cl[n], label = category_list[n])

    kde = stats.gaussian_kde(x)
    xx = np.linspace(xmin, xmax, 1000)
    axb.plot(xx, kde(xx), color=cl[n])

    kde = stats.gaussian_kde(y)
    yy = np.linspace(ymin, ymax, 1000)
    axl.plot(kde(yy), yy, color=cl[n])

# Copy legend object from scatter plot to lower left subplot and display:
# NB 'scatterpoints = 1' customises legend box to show only 1 handle (icon) per label 
handles, labels = ax.get_legend_handles_labels()
axc.legend(handles, labels, title = headers[2], scatterpoints = 1, loc = 'center', fontsize = 12)

plt.show()
DaveW
  • 185
  • 1
  • 8