MDS plots of binary data: counterintuitive clustering

Question

Say I have the following binary data frame, df.

structure(list(a = c(0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0), b = c(0, 
0, 0, 0, 0, 0, 1, 0, 1, 0, 1), c = c(0, 0, 0, 0, 1, 0, 0, 1, 
0, 1, 0), d = c(1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0), e = c(0, 0, 
1, 0, 0, 0, 0, 0, 0, 0, 1), f = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 
0, 1), g = c(0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0), h = c(1, 0, 0, 
0, 0, 0, 0, 1, 1, 0, 0), i = c(0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 
0)), class = "data.frame", row.names = c(NA, -11L), .Names = c("a", 
"b", "c", "d", "e", "f", "g", "h", "i"))

> df
   a b c d e f g h i
1  0 0 0 1 0 0 0 1 0
2  0 0 0 0 0 0 1 0 0
3  0 0 0 0 1 1 1 0 0
4  0 0 0 0 0 0 1 0 0
5  1 0 1 0 0 0 0 0 0
6  0 0 0 0 0 0 0 0 1
7  0 1 0 1 0 0 0 0 0
8  1 0 1 0 0 0 0 1 0
9  0 1 0 0 0 0 0 1 1
10 0 0 1 1 0 0 0 0 1
11 0 1 0 0 1 1 0 0 0

I want to examine the similarities between rows and, therefore, use an MDS plot. I perform a classical MDS scaling using the binary (i.e., Jaccard) method in dist.

# Load libraries
library(dplyr)
library(ggplot2)
library(magrittr)

# Perform MDS scaling using binary method
mds_df <- df %>% 
  dist(method = "binary") %>% 
  cmdscale

Next, I label my columns, bind them to my original data frame, and add row numbers to be used as labels in my plot.

# Name columns
colnames(mds_df) <- c("mds_x", "mds_y")

# Bind to original data frame
df %<>% 
  cbind(mds_df) %>% 
  mutate(tags = row_number())

Finally, I plot my results with ggplot2.

g <- ggplot(df) + geom_point(aes(x = mds_x, y = mds_y), size = 5)
g <- g + geom_text(aes(x = mds_x, y = mds_y, label = tags), position = position_jitter(width = 0.05, height = 0.05))
g <- g + xlab("Coordinate 1") + ylab("Coordinate 2")
print(g)

Now, notice rows 2 and 4 in the matrix are exactly the same. In the figure they fall right on top of each other. Great! Makes sense. Next, look at rows 6 & 7. They have no common 1 values yet fall fairly close together. Hmm. To make things worse, rows 3 & 11 have two 1s in common yet are plotted much further apart. Weird.

I realise that the Jaccard approach compares those common elements with the total number of elements in both sets (i.e., intersect over union), but rows 6 & 7 have three elements not in common and none in common whereas 3 & 11 have two elements in common and two not in common. Intuitively, I feel like 3 & 11 should fall closer together than 6 & 7. Is this due to a poor choice of distance metric or a flaw in my coding/logic? Is there another plotting method that would show these results in a more intuitive way?

score 4 · Accepted Answer · answered Feb 14 '18 at 17:37

Since you have 9 variables, you are plotting 11 observations in 9-dimensional space. When you squash that down to 2-dimensional space details are lost. If you run cmdscale() with eig=TRUE you will get more information about the final solution. The GOF value is the goodness of fit with 1.0 being a perfect score. You have .52 so you are displaying about 52% of the spatial information in the 9-dimensions in 2 dimensions. That is good but not great. If you increase to 3-dimensions, you will get a GOF value of .68. The cmdscale() function computes metric multidimensional scaling (aka principal coordinates analysis).

Since you have loaded the vegan package, you have the option to try non-metric multidimensional (NMDS) scaling with monoMDS() or metaMDS(). The problem with NMDS is that the solution can find a local minimum so it is best to try several runs and pick the best one. That is what metaMDS() does. By default it tries 20 random starting configurations. If 2 of them are essentially identical, they are convergent. Your data does not find 2 identical solutions, so I am just plotting the best one of the 20. Using trymax=100, I finally got convergent solutions but that solution is not noticeably different from the one using the default 20 tries:

df.dst <- dist(df, method="binary")
df.meta <- metaMDS(df.dst)
plot(df.meta, "sites")
text(df.meta, "sites", pos=3)

I think the distances are a bit better represented in this plot. Certainly 11 and 3 are closer than 6 and 7.

Thanks for your answer. Those are really interesting points. To clarify: what do you mean by "best" solution? Is there some metric that quantifies that? Or did you choose the best solution by looking at the plots? — Dan, Feb 14 '18 at 17:54
"Best" means the smallest stress value. There is no way to guarantee that a stress value is the best since there is no deterministic way to solve the scaling problem. We can only repeat the process using different starting configurations many times and choose the smallest. For the run that I plotted, the stress value was .01505283. Five more runs produced stress values of .0150647, .01505594, .01508096, .01506501, and .1509079 so you can see they are all pretty close. Each run of `metaMDS(df.dst, trace=0)` will give you the stress value for the best of 20 tries. — dcarlson, Feb 14 '18 at 18:13
I wanted to add that since we are fitting distances between points, rotation or reflection of the plot does not change the distances. As a result 2 plots can appear different but have the same stress value. — dcarlson, Feb 14 '18 at 18:16
This is great. I went with `metaMDS(df, trymax = 1000, distance = "jaccard", binary = TRUE)` and it produced very intuitive results, like those you show. Thanks for your help! — Dan, Feb 14 '18 at 19:21

MDS plots of binary data: counterintuitive clustering

1 Answers1