There are two sets of "labels" in that plot:
one is the "true" labels of the Fisher Iris dataset (the species
variable which contains the class of each instance: setoas
, versicolor
, or virginica
). Normally you wouldn't have those in a real dataset (after all the goal of clustering is to discover those groups within the data, which you don't know beforehand). I just used them here to get an idea of how well the EM clustering performed against the actual truth (the scatter points are color-coded according to the class).
the other set of labels are the clusters we found using GMM. Basically I built a 50x50 grid of 2D points to cover the entire data domain, I then assign a cluster to each of those points by computing the posterior probability and choosing the component with highest likelihood. I showed those clusters in the background color. As a nice consequence, we get to see the discriminant decision boundaries between the clusters.
You can see that the cluster of points on the left got separated quite nicely (and perfectly matched the setosa
class). While the points on the right side of the plot got separated in two matching the other two classes, although there were instance "misclassified" if you will (some green points on the wrong side of the boundary).
Typically in a real setting you wouldn't have those actual classes to compare against, so no way to tell how "accurate" your clustering was (there exist other metrics for clustering performance evaluation)...
