1

I'm trying to build a scatterplot of a large amount of data from multiple classes in python/matplotlib. Unfortunately, it appears that I have to choose between having my data randomised and having legend labels. Is there a way I can have both (preferably without manually coding the labels?)

Minimum reproducible example:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
X = np.random.normal(0, 1, [5000, 2])
Y = np.random.normal(0.5, 1, [5000, 2]) 
data = np.concatenate([X,Y])
classes = np.concatenate([np.repeat('X', X.shape[0]),
                          np.repeat('Y', Y.shape[0])])

Plotting with randomized points:

plot_idx = np.random.permutation(data.shape[0])
colors = pd.factorize(classes)
fig, ax = plt.subplots()
ax.scatter(data[plot_idx, 0], 
           data[plot_idx, 1], 
           c=colors[plot_idx],
           label=classes[plot_idx],
           alpha=0.4)
plt.legend()
plt.show()

scatterplot with randomized points

This gives me the wrong legend.

Plotting with the correct legend:

from matplotlib import cm
unique_classes = np.unique(classes)
colors = cm.Set1(np.linspace(0, 1, len(unique_classes)))
for i, class in enumerate(unique_classes):
    ax.scatter(data[classes == class, 0], 
               data[classes == class, 1],
               c=colors[i],
               label=class,
               alpha=0.4)
plt.legend()
plt.show()

working legend with non-randomized points

But now the points are not randomized and the resulting plot is not representative of the data.

I'm looking for something that would give me a result like I get as follows in R:

library(ggplot2)
X <- matrix(rnorm(10000, 0, 1), ncol=2)
Y <- matrix(rnorm(10000, 0.5, 1), ncol=2)
data <- as.data.frame(rbind(X, Y))
data$classes <- rep(c('X', 'Y'), times=nrow(X))
plot_idx <- sample(nrow(data))

ggplot(data[plot_idx,], aes(x=V1, y=V2, color=classes)) +
  geom_point(alpha=0.4, size=3)

randomized points with working legend in R

Scott Gigante
  • 1,450
  • 1
  • 17
  • 29

2 Answers2

1

You need to create the legend manually. This is not a big problem though. You can loop over the labels and create a legend entry for each. Here one may use a Line2D with a marker similar to the scatter as handle.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
X = np.random.normal(0, 1, [5000, 2])
Y = np.random.normal(0.5, 1, [5000, 2]) 
data = np.concatenate([X,Y])
classes = np.concatenate([np.repeat('X', X.shape[0]),
                          np.repeat('Y', Y.shape[0])])

plot_idx = np.random.permutation(data.shape[0])
colors,labels = pd.factorize(classes)

fig, ax = plt.subplots()
sc = ax.scatter(data[plot_idx, 0], 
           data[plot_idx, 1], 
           c=colors[plot_idx],
           alpha=0.4)

h = lambda c: plt.Line2D([],[],color=c, ls="",marker="o")
plt.legend(handles=[h(sc.cmap(sc.norm(i))) for i in range(len(labels))],
           labels=list(labels))
plt.show()

enter image description here

Alternatively you can use a special scatter handler, as shown in the quesiton Why doesn't the color of the points in a scatter plot match the color of the points in the corresponding legend? but that seems a bit overkill here.

ImportanceOfBeingErnest
  • 321,279
  • 53
  • 665
  • 712
0

It's a bit of a hack, but you can save the axis limits, set the labels by drawing points well outside the limits of the plot, and then resetting the axis limits as follows:

plot_idx = np.random.permutation(data.shape[0])
color_idx, unique_classes = pd.factorize(classes)
colors = cm.Set1(np.linspace(0, 1, len(unique_classes)))
fig, ax = plt.subplots()
ax.scatter(data[plot_idx, 0], 
           data[plot_idx, 1], 
           c=colors[color_idx[plot_idx]],
           alpha=0.4)
xlim = ax.get_xlim()
ylim = ax.get_ylim()
for i in range(len(unique_classes)):
    ax.scatter(xlim[1]*10, 
               ylim[1]*10, 
               c=colors[i], 
               label=unique_classes[i])
ax.set_xlim(xlim)
ax.set_ylim(ylim)
plt.legend()
plt.show()

randomized points with accurate legend

Scott Gigante
  • 1,450
  • 1
  • 17
  • 29