Say I have the following binary data frame, df
.
structure(list(a = c(0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0), b = c(0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 1), c = c(0, 0, 0, 0, 1, 0, 0, 1,
0, 1, 0), d = c(1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0), e = c(0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 1), f = c(0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 1), g = c(0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0), h = c(1, 0, 0,
0, 0, 0, 0, 1, 1, 0, 0), i = c(0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
0)), class = "data.frame", row.names = c(NA, -11L), .Names = c("a",
"b", "c", "d", "e", "f", "g", "h", "i"))
> df
a b c d e f g h i
1 0 0 0 1 0 0 0 1 0
2 0 0 0 0 0 0 1 0 0
3 0 0 0 0 1 1 1 0 0
4 0 0 0 0 0 0 1 0 0
5 1 0 1 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 1
7 0 1 0 1 0 0 0 0 0
8 1 0 1 0 0 0 0 1 0
9 0 1 0 0 0 0 0 1 1
10 0 0 1 1 0 0 0 0 1
11 0 1 0 0 1 1 0 0 0
I want to examine the similarities between rows and, therefore, use an MDS plot. I perform a classical MDS scaling using the binary
(i.e., Jaccard) method in dist
.
# Load libraries
library(dplyr)
library(ggplot2)
library(magrittr)
# Perform MDS scaling using binary method
mds_df <- df %>%
dist(method = "binary") %>%
cmdscale
Next, I label my columns, bind them to my original data frame, and add row numbers to be used as labels in my plot.
# Name columns
colnames(mds_df) <- c("mds_x", "mds_y")
# Bind to original data frame
df %<>%
cbind(mds_df) %>%
mutate(tags = row_number())
Finally, I plot my results with ggplot2
.
g <- ggplot(df) + geom_point(aes(x = mds_x, y = mds_y), size = 5)
g <- g + geom_text(aes(x = mds_x, y = mds_y, label = tags), position = position_jitter(width = 0.05, height = 0.05))
g <- g + xlab("Coordinate 1") + ylab("Coordinate 2")
print(g)
Now, notice rows 2 and 4 in the matrix are exactly the same. In the figure they fall right on top of each other. Great! Makes sense. Next, look at rows 6 & 7. They have no common 1
values yet fall fairly close together. Hmm. To make things worse, rows 3 & 11 have two 1
s in common yet are plotted much further apart. Weird.
I realise that the Jaccard approach compares those common elements with the total number of elements in both sets (i.e., intersect over union), but rows 6 & 7 have three elements not in common and none in common whereas 3 & 11 have two elements in common and two not in common. Intuitively, I feel like 3 & 11 should fall closer together than 6 & 7. Is this due to a poor choice of distance metric or a flaw in my coding/logic? Is there another plotting method that would show these results in a more intuitive way?