cosine similarity(patient similarity metric) between 48k patients data with predictive variables

Question

I have to calculate cosine similarity (patient similarity metric) in R between 48k patients data with some predictive variables. Here is the equation: PSM(P1,P2) = P1.P2/ ||P1|| ||P2||

where P1 and P2 are the predictor vectors corresponding to two different patients, where for example P1 index patient and P2 will be compared with index (P1) and finally pairwise patient similarity metric PSM(P1,P2) will be calculated.

This process will go on for all 48k patients.

I have added sample data-set for 300 patients in a .csv file. Please find the sample data-set here.https://1drv.ms/u/s!AhoddsPPvdj3hVTSbosv2KcPIx5a

please find the sample data-set for 300 patients in a .csv file. Please find the sample data-set here. https://1drv.ms/u/s!AhoddsPPvdj3hVTSbosv2KcPIx5a — Md Shakawath Hossain, Jun 09 '17 at 14:50
Also, do you have some gold standard/known cases of "similar" and "different" patients? It's not required for the kind of metric you're requesting, but it's a nice reality check after the calculations are completed. — Mark Miller, Jun 09 '17 at 15:46
If you have 10^6 patients, your similarity matrix will be about 10^6 x 10^6. How much memory do you have? — Hong Ooi, Jun 09 '17 at 16:46

Mark Miller · Accepted Answer · 2017-06-09T18:48:27.070

First things first: You can find more rigorous treatments of cosine similarity at either of these posts:

Now, you clearly have a mixture of data types in your input, at least

decimal
integer
categorical

I suspect that some of the integer values are Booleans or additional categoricals. Generally, it will be up to you to transform these into continuous numerical vectors if you want to use them as input into the similarity calculation. For example, what's the distance between admission types ELECTIVE and EMERGENCY? Is it a nominal or ordinal variable? I will only be modelling the columns that I trust to be numerical dependent variables.

Also, what have you done to ensure that some of your columns don't correlate with others? Using just a little awareness of data science and biomedical terminology, it seems likely that the following are all correlated:

diasbp_max, diasbp_min, meanbp_max, meanbp_min, sysbp_max and sysbp_min

I suggest going to a print shop and ordering a poster-size printout of psm_pairs.pdf. :-) Your eyes are better at detecting meaningful (but non-linear) dependencies between variable. Including multiple measurements of the same fundamental phenomenon may over-weight that phenomenon in your similarity calculation. Don't forget that you can derive variables like

diasbp_rage <- diasbp_max - diasbp_min

Now, I'm not especially good at linear algebra, so I'm importing a cosine similarity function form the lsa text analysis package. I'd love to see you write out the formula in your question as an R function. I would write it to compare one row to another, and use two nested apply loops to get all comparisons. Hopefully we'll get the same results!

After calculating the similarity, I try to find two different patients with the most dissimilar encounters.

Since you're working with a number of rows that's relatively large, you'll want to compare various algorithmic methodologies for efficiency. In addition, you could use SparkR/some other Hadoop solution on a cluster, or the parallel package on a single computer with multiple cores and lots of RAM. I have no idea whether the solution I provided is thread-safe.

Come to think of it, the transposition alone (as I implemented it) is likely to be computationally costly for a set of 1 million patient-encounters. Overall, (If I remember my computational complexity correctly) as the number of rows in your input increases, the performance could degrade exponentially.

library(lsa)
library(reshape2)

psm_sample <- read.csv("psm_sample.csv")

row.names(psm_sample) <-
  make.names(paste0("patid.", as.character(psm_sample$subject_id)), unique = TRUE)
temp <- sapply(psm_sample, class)
temp <- cbind.data.frame(names(temp), as.character(temp))
names(temp) <- c("variable", "possible.type")

numeric.cols <- (temp$possible.type %in% c("factor", "integer") &
                   (!(grepl(
                     pattern = "_id$", x = temp$variable
                   ))) &
                   (!(
                     grepl(pattern = "_code$", x = temp$variable)
                   )) &
                   (!(
                     grepl(pattern = "_type$", x = temp$variable)
                   ))) | temp$possible.type == "numeric"

psm_numerics <- psm_sample[, numeric.cols]
row.names(psm_numerics) <- row.names(psm_sample)

psm_numerics$gender <- as.integer(psm_numerics$gender)

psm_scaled <- scale(psm_numerics)

pair.these.up <- psm_scaled
# checking for independence of variables
# if the following PDF pair plot is too big for your computer to open,
# try pair-plotting some random subset of columns
# keep.frac <- 0.5
# keep.flag <- runif(ncol(psm_scaled)) < keep.frac
# pair.these.up <- psm_scaled[, keep.flag]
# pdf device sizes are in inches
dev <-
  pdf(
    file = "psm_pairs.pdf",
    width = 50,
    height = 50,
    paper = "special"
  )
pairs(pair.these.up)
dev.off()

#transpose the dataframe to get the
#similarity between patients
cs <- lsa::cosine(t(psm_scaled))

# this is super inefficnet, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")

extract.pat <- function(enc.col) {
  my.patients <-
    sapply(enc.col, function(one.pat) {
      temp <- (strsplit(as.character(one.pat), ".", fixed = TRUE))
      return(temp[[1]][[2]])
    })
  return(my.patients)
}
cs.melt$pat.A <- extract.pat(cs.melt$enc.A)
cs.melt$pat.B <- extract.pat(cs.melt$enc.B)

same.pat <-      cs.melt[cs.melt$pat.A == cs.melt$pat.B ,]
different.pat <- cs.melt[cs.melt$pat.A != cs.melt$pat.B ,]

most.dissimilar <-
  different.pat[which.min(different.pat$similarity),]

dissimilar.pat.frame <- rbind(psm_numerics[rownames(psm_numerics) ==
                                             as.character(most.dissimilar$enc.A) ,],
                              psm_numerics[rownames(psm_numerics) ==
                                             as.character(most.dissimilar$enc.B) ,])

print(t(dissimilar.pat.frame))

which gives

                  patid.68.49   patid.9
gender                1.00000   2.00000
age                  41.85000  41.79000
sysbp_min            72.00000 106.00000
sysbp_max            95.00000 217.00000
diasbp_min           42.00000  53.00000
diasbp_max           61.00000 107.00000
meanbp_min           52.00000  67.00000
meanbp_max           72.00000 132.00000
resprate_min         20.00000  14.00000
resprate_max         35.00000  19.00000
tempc_min            36.00000  35.50000
tempc_max            37.55555  37.88889
spo2_min             90.00000  95.00000
spo2_max            100.00000 100.00000
bicarbonate_min      22.00000  26.00000
bicarbonate_max      22.00000  30.00000
creatinine_min        2.50000   1.20000
creatinine_max        2.50000   1.40000
glucose_min          82.00000 129.00000
glucose_max          82.00000 178.00000
hematocrit_min       28.10000  37.40000
hematocrit_max       28.10000  45.20000
potassium_min         5.50000   2.80000
potassium_max         5.50000   3.00000
sodium_min          138.00000 136.00000
sodium_max          138.00000 140.00000
bun_min              28.00000  16.00000
bun_max              28.00000  17.00000
wbc_min               2.50000   7.50000
wbc_max               2.50000  13.70000
mingcs               15.00000  15.00000
gcsmotor              6.00000   5.00000
gcsverbal             5.00000   0.00000
gcseyes               4.00000   1.00000
endotrachflag         0.00000   1.00000
urineoutput        1674.00000 887.00000
vasopressor           0.00000   0.00000
vent                  0.00000   1.00000
los_hospital         19.09310   4.88130
los_icu               3.53680   5.32310
sofa                  3.00000   5.00000
saps                 17.00000  18.00000
posthospmort30day     1.00000   0.00000

Thanks for your effort !! First of all I don’t need dissimilar patients. I am very new in R. It would be nice if you write the formula in the question I mentioned and to compare one row to another kindly write the two nested apply loops to get all comparisons. When I run your code and Print the similar patients I can see only positive valued similarities but it can be negative also and with my little knowledge I think it’s not pair-wise calculation in contrast to the way I motioned like (P1, P2) where P1 index patient and P2 will be compared with index (P1) — Md Shakawath Hossain, Jun 22 '17 at 09:30
Now some more clarification about PSM calculation and about your queries: 1. PSM is the cosine of an angle, it is normalized between -1 (minimum similarity when angle is 180 degree) and 1 (maximum similarity when angle is 0 degree). 2. Each continuous predictor was normalized to fit the range between -1 and 1 so that all predictors could equally contribute to the PSM. 3. For each categorical predictor product between two vectors should be assigned 1 if they have same category other wise -1, in an all-or-none fashion. — Md Shakawath Hossain, Jun 22 '17 at 09:31
I have already invested a few hours on this question. If it isn't helpful at all, then I can understand that it hasn't been upvoted. **I'm even willing to put more time into it, but I'd like to see more of your work first.** You say you're new to R... what languages are you more proficient in? I would love to see **your solution in any language**. Maybe you should try with a simpler dataset first, like the **iris**es. — Mark Miller, Jun 22 '17 at 10:56
I can assure you that this example does generate **pairwise** similarities... did you look at `cs`? I have not looked at `lsa::cosine` under the hood, but I am confident that it calculates the cosine similarity, just as you have described. If you can show me, in code, that the calculation is incorrect, please do. I reported **dissimilar** patients, even though you may not be interested in them, as a visually convincing indicator of algorithmic success, since you didn't provide a gold standard of patient similarity. — Mark Miller, Jun 22 '17 at 10:56
As others have commented, pairwise operations over thousands or millions of rows requires attention to efficiency from the very beginning, and you should be very careful about **nested loops**. — Mark Miller, Jun 22 '17 at 10:56

Mark Miller · Answer 2 · 2017-06-23T12:35:03.023

Usually I wouldn't add a second answer, but that might be the best solution here. Don't worry about voting on it.

Here's the same algorithm as in my first answer, applied to the iris data set. Each row contains four spatial measurements of the flowers form three different varieties of iris plants.

Below that you will find the iris analysis, written out as nested loops so you can see the equivalence. But that's not recommended for production with large data sets.

Please familiarize yourself with starting data and all of the intermediate dataframes:

The input iris data
psm_scaled (the spatial measurements, scaled to mean=0, SD=1)
cs (the matrix of pairwise similarities)
cs.melt (the pairwise similarities in long format)

At the end I have aggregated the mean similarities for all comparisons between one variety and another. You will see that comparisons between individuals of the same variety have mean similarities approaching 1, and comparisons between individuals of the same variety have mean similarities approaching negative 1.

library(lsa)
library(reshape2)

temp <- iris[, 1:4]

iris.names <- paste0(iris$Species, '.', rownames(iris))

psm_scaled <- scale(temp)
rownames(psm_scaled) <- iris.names

cs <- lsa::cosine(t(psm_scaled))

# this is super inefficient, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
names(cs.melt) <- c("flower.A", "flower.B", "similarity")

class.A <-
  strsplit(as.character(cs.melt$flower.A), '.', fixed = TRUE)
cs.melt$class.A <- sapply(class.A, function(one.split) {
  return(one.split[1])
})
class.B <-
  strsplit(as.character(cs.melt$flower.B), '.', fixed = TRUE)
cs.melt$class.B <- sapply(class.B, function(one.split) {
  return(one.split[1])
})

cs.melt$comparison <-
  paste0(cs.melt$class.A , '_vs_', cs.melt$class.B)

cs.agg <-
  aggregate(cs.melt$similarity, by = list(cs.melt$comparison), mean)

print(cs.agg[order(cs.agg$x),])

which gives

#                    Group.1          x
# 3      setosa_vs_virginica -0.7945321
# 7      virginica_vs_setosa -0.7945321
# 2     setosa_vs_versicolor -0.4868352
# 4     versicolor_vs_setosa -0.4868352
# 6  versicolor_vs_virginica  0.3774612
# 8  virginica_vs_versicolor  0.3774612
# 5 versicolor_vs_versicolor  0.4134413
# 9   virginica_vs_virginica  0.7622797
# 1         setosa_vs_setosa  0.8698189

If you’re still not comfortable with performing lsa::cosine() on a scaled, numerical dataframe, we can certainly do explicit pairwise calculations.

The formula you gave for PSM, or cosine similarity of patients, is expressed in two formats at Wikipedia

Remembering that vectors A and B represent the ordered list of attributes for PatientA and PatientB, the PSM is the dot product of A and B, divided by (the scalar product of [the magnitude of A] and [the magnitude of B])

The terse way of saying that in R is

cosine.sim <- function(A, B) { A %*% B / sqrt(A %*% A * B %*% B) }

But we can rewrite that to look more similar to your post as

cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }

I guess you could even re-write that (the calculations of similarity between a single pair of individuals) as a bunch of nested loops, but in the case of a manageable amount of data, please don’t. R is highly optimized for operations on vectors and matrices. If you’re new to R, don’t second guess it. By the way, what happened to your millions of rows? This will certainly be less stressful now that your down to tens of thousands.

Anyway, let’s say that each individual only has two elements.

individual.1 <- c(1, 0)
individual.2 <- c(1, 1)

So you can think of individual.1 as a line that passes between the origin (0,0) and (0, 1) and individual.2 as a line that passes between the origin and (1, 1).

some.data <- rbind.data.frame(individual.1, individual.2)
names(some.data) <- c('element.i', 'element.j')
rownames(some.data) <- c('individual.1', 'individual.2')

plot(some.data, xlim = c(-0.5, 2), ylim = c(-0.5, 2))
text(
  some.data,
  rownames(some.data),
  xlim = c(-0.5, 2),
  ylim = c(-0.5, 2),
  adj = c(0, 0)
)

segments(0, 0, x1 = some.data[1, 1], y1 = some.data[1, 2])
segments(0, 0, x1 = some.data[2, 1], y1 = some.data[2, 2])

So what’s the angle between vector individual.1 and vector individual.2? You guessed it, 0.785 radians, or 45 degrees.

cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
cos.sim.result <- cosine.sim(individual.1, individual.2)
angle.radians <- acos(cos.sim.result)
angle.degrees <- angle.radians * 180 / pi

print(angle.degrees)

#      [,1]
# [1,]   45

Now we can use the cosine.sim function I previously defined, in two nested loops, to explicitly calculate the pairwise similarities between each of the iris flowers. Remember, psm_scaled has already been defined as the scaled numerical values from the iris dataset.

cs.melt <- lapply(rownames(psm_scaled), function(name.A) {
  inner.loop.result <-
    lapply(rownames(psm_scaled), function(name.B) {
      individual.A <- psm_scaled[rownames(psm_scaled) == name.A, ]
      individual.B <- psm_scaled[rownames(psm_scaled) == name.B, ]
      similarity <- cosine.sim(individual.A, individual.B)
      return(list(name.A, name.B, similarity))
    })
  inner.loop.result <-
    do.call(rbind.data.frame, inner.loop.result)
  names(inner.loop.result) <-
    c('flower.A', 'flower.B', 'similarity')
  return(inner.loop.result)
})
cs.melt <- do.call(rbind.data.frame, cs.melt)

Now we repeat the calculation of cs.melt$class.A, cs.melt$class.B, and cs.melt$comparison as above, and calculate cs.agg.from.loops as the mean similarity between the various types of comparisons:

cs.agg.from.loops <-
  aggregate(cs.agg.from.loops$similarity, by = list(cs.agg.from.loops $comparison), mean)

print(cs.agg.from.loops[order(cs.agg.from.loops$x),])

#                    Group.1          x
# 3      setosa_vs_virginica -0.7945321
# 7      virginica_vs_setosa -0.7945321
# 2     setosa_vs_versicolor -0.4868352
# 4     versicolor_vs_setosa -0.4868352
# 6  versicolor_vs_virginica  0.3774612
# 8  virginica_vs_versicolor  0.3774612
# 5 versicolor_vs_versicolor  0.4134413
# 9   virginica_vs_virginica  0.7622797
# 1         setosa_vs_setosa  0.8698189

Which, I believe is identical to the result we got with lsa::cosine.

So what I'm trying to say is... why wouldn't you use lsa::cosine?

Maybe you should be more concerned with

selection of variables, including removal of highly correlated variables
scaling/normalizing/standardizing the data
performance with a large input data set
identifying known similars and dissimilars for quality control

as previously addressed

First of all beg apologize to you ask you again. There is some change in my requirements. For now the pairwise similarities is like patient (3 vs 3), (3.1 vs 3), (9 vs 9) and so on. I don’t need pairwise calculation between itself like (3 vs 3) and so on … I need pariwise calculation each row wise, where each one row will be compared with rest of the other rows and then for next iteration it will be excluded. — Md Shakawath Hossain, Jun 29 '17 at 09:34
For example: pairwise calculation between patient (3 vs 3.1), (3 vs 9), (3 vs9.1) …. (3 vs last patient in the data frame) and then patient it will be excluded from the iteration. This process will go on for all the rows. Here is sample of output -- https://1drv.ms/w/s!AhoddsPPvdj3hVcH6uYE37uhVLvY — Md Shakawath Hossain, Jun 29 '17 at 09:38
That's an reasonable request, but this post is already way too long. I recommend starting a new question and showing what you've done so far, and why it doesn't meet your requirements. Maybe you'll get some input form some other people in addition to me. You are most likely to get responses if you show exactly what your input data look like, exactly what your desired result looks like, and the code you've tried so far. It takes a little work, but that's the price you pay for free professional advice on stack overflow. — Mark Miller, Jun 29 '17 at 13:10

cosine similarity(patient similarity metric) between 48k patients data with predictive variables

2 Answers2