Usually I wouldn't add a second answer, but that might be the best solution here. Don't worry about voting on it.
Here's the same algorithm as in my first answer, applied to the iris data set. Each row contains four spatial measurements of the flowers form three different varieties of iris plants.
Below that you will find the iris analysis, written out as nested loops so you can see the equivalence. But that's not recommended for production with large data sets.
Please familiarize yourself with starting data and all of the intermediate dataframes:
- The input
iris
data
psm_scaled
(the spatial measurements, scaled to mean=0, SD=1)
cs
(the matrix of pairwise similarities)
cs.melt
(the pairwise similarities in long format)
At the end I have aggregated the mean similarities for all comparisons between one variety and another. You will see that comparisons between individuals of the same variety have mean similarities approaching 1, and comparisons between individuals of the same variety have mean similarities approaching negative 1.
library(lsa)
library(reshape2)
temp <- iris[, 1:4]
iris.names <- paste0(iris$Species, '.', rownames(iris))
psm_scaled <- scale(temp)
rownames(psm_scaled) <- iris.names
cs <- lsa::cosine(t(psm_scaled))
# this is super inefficient, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
names(cs.melt) <- c("flower.A", "flower.B", "similarity")
class.A <-
strsplit(as.character(cs.melt$flower.A), '.', fixed = TRUE)
cs.melt$class.A <- sapply(class.A, function(one.split) {
return(one.split[1])
})
class.B <-
strsplit(as.character(cs.melt$flower.B), '.', fixed = TRUE)
cs.melt$class.B <- sapply(class.B, function(one.split) {
return(one.split[1])
})
cs.melt$comparison <-
paste0(cs.melt$class.A , '_vs_', cs.melt$class.B)
cs.agg <-
aggregate(cs.melt$similarity, by = list(cs.melt$comparison), mean)
print(cs.agg[order(cs.agg$x),])
which gives
# Group.1 x
# 3 setosa_vs_virginica -0.7945321
# 7 virginica_vs_setosa -0.7945321
# 2 setosa_vs_versicolor -0.4868352
# 4 versicolor_vs_setosa -0.4868352
# 6 versicolor_vs_virginica 0.3774612
# 8 virginica_vs_versicolor 0.3774612
# 5 versicolor_vs_versicolor 0.4134413
# 9 virginica_vs_virginica 0.7622797
# 1 setosa_vs_setosa 0.8698189
If you’re still not comfortable with performing lsa::cosine()
on a scaled, numerical dataframe, we can certainly do explicit pairwise calculations.
The formula you gave for PSM, or cosine similarity of patients, is expressed in two formats at Wikipedia
Remembering that vectors A and B represent the ordered list of attributes for PatientA and PatientB, the PSM is the dot product of A and B, divided by (the scalar product of [the magnitude of A] and [the magnitude of B])
The terse way of saying that in R is
cosine.sim <- function(A, B) { A %*% B / sqrt(A %*% A * B %*% B) }
But we can rewrite that to look more similar to your post as
cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
I guess you could even re-write that (the calculations of similarity between a single pair of individuals) as a bunch of nested loops, but in the case of a manageable amount of data, please don’t. R is highly optimized for operations on vectors and matrices. If you’re new to R, don’t second guess it. By the way, what happened to your millions of rows? This will certainly be less stressful now that your down to tens of thousands.
Anyway, let’s say that each individual only has two elements.
individual.1 <- c(1, 0)
individual.2 <- c(1, 1)
So you can think of individual.1 as a line that passes between the origin (0,0) and (0, 1) and individual.2 as a line that passes between the origin and (1, 1).
some.data <- rbind.data.frame(individual.1, individual.2)
names(some.data) <- c('element.i', 'element.j')
rownames(some.data) <- c('individual.1', 'individual.2')
plot(some.data, xlim = c(-0.5, 2), ylim = c(-0.5, 2))
text(
some.data,
rownames(some.data),
xlim = c(-0.5, 2),
ylim = c(-0.5, 2),
adj = c(0, 0)
)
segments(0, 0, x1 = some.data[1, 1], y1 = some.data[1, 2])
segments(0, 0, x1 = some.data[2, 1], y1 = some.data[2, 2])
So what’s the angle between vector individual.1 and vector individual.2? You guessed it, 0.785 radians, or 45 degrees.

cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
cos.sim.result <- cosine.sim(individual.1, individual.2)
angle.radians <- acos(cos.sim.result)
angle.degrees <- angle.radians * 180 / pi
print(angle.degrees)
# [,1]
# [1,] 45
Now we can use the cosine.sim
function I previously defined, in two nested loops, to explicitly calculate the pairwise similarities between each of the iris flowers. Remember, psm_scaled
has already been defined as the scaled numerical values from the iris
dataset.
cs.melt <- lapply(rownames(psm_scaled), function(name.A) {
inner.loop.result <-
lapply(rownames(psm_scaled), function(name.B) {
individual.A <- psm_scaled[rownames(psm_scaled) == name.A, ]
individual.B <- psm_scaled[rownames(psm_scaled) == name.B, ]
similarity <- cosine.sim(individual.A, individual.B)
return(list(name.A, name.B, similarity))
})
inner.loop.result <-
do.call(rbind.data.frame, inner.loop.result)
names(inner.loop.result) <-
c('flower.A', 'flower.B', 'similarity')
return(inner.loop.result)
})
cs.melt <- do.call(rbind.data.frame, cs.melt)
Now we repeat the calculation of cs.melt$class.A
, cs.melt$class.B
, and cs.melt$comparison
as above, and calculate cs.agg.from.loops
as the mean similarity between the various types of comparisons:
cs.agg.from.loops <-
aggregate(cs.agg.from.loops$similarity, by = list(cs.agg.from.loops $comparison), mean)
print(cs.agg.from.loops[order(cs.agg.from.loops$x),])
# Group.1 x
# 3 setosa_vs_virginica -0.7945321
# 7 virginica_vs_setosa -0.7945321
# 2 setosa_vs_versicolor -0.4868352
# 4 versicolor_vs_setosa -0.4868352
# 6 versicolor_vs_virginica 0.3774612
# 8 virginica_vs_versicolor 0.3774612
# 5 versicolor_vs_versicolor 0.4134413
# 9 virginica_vs_virginica 0.7622797
# 1 setosa_vs_setosa 0.8698189
Which, I believe is identical to the result we got with lsa::cosine
.
So what I'm trying to say is... why wouldn't you use lsa::cosine
?
Maybe you should be more concerned with
- selection of variables, including removal of highly correlated variables
- scaling/normalizing/standardizing the data
- performance with a large input data set
- identifying known similars and dissimilars for quality control
as previously addressed