Retrieving top k similar rows in a matrix for each row via cosine similarity in R

Question

How to efficiently retrieve top K-similar vectors by cosine similarity using R? asks how to calculate top similar vectors for each vector of one matrix, relative to another matrix. It's satisfactorily answered, and I'd like to tweak it to operate on a single matrix.

That is, I'd like the top k similar other rows for each row in a matrix. I suspect the solution is very similar, but can be optimized.

score -1 · Accepted Answer · answered Sep 26 '15 at 00:57

This function is based on the linked answer:

CosineSimilarities <- function(m, top.k) {
  # Computes cosine similarity between each row and all other rows in a matrix.
  #
  # Args:
  #   m: Matrix of values.
  #   top.k: Number of top rows to show for each row.
  #
  # Returns:
  #   Data frame with columns for pair of rows, and cosine similarity, for top
  #   `top.k` rows per row.
  #   
  # Similarity computation
  cp <- tcrossprod(m)
  mm <- rowSums(m ^ 2)
  result <- cp / sqrt(outer(mm, mm))
  # Top similar rows from train (per row)
  # Use `top.k + 1` to remove the self-reference (similarity = 1)
  top <- apply(result, 2, order, decreasing=TRUE)[seq(top.k + 1), ]
  result.df <- data.frame(row.id1=c(col(top)), row.id2=c(top))
  result.df$cosine.similarity <- result[as.matrix(result.df[, 2:1])]
  # Remove same-row records and return
  return(result.df[result.df$row.id1 != result.df$row.id2, ])
}

For example:

(m <- matrix(1:9, nrow=3))
#      [,1] [,2] [,3]
# [1,]    1    4    7
# [2,]    2    5    8
# [3,]    3    6    9
CosineSimilarities(m, 1)
#   row.id1 row.id2 cosine.similarity
# 2       1       2            0.9956
# 4       2       3            0.9977
# 6       3       2            0.9977

Retrieving top k similar rows in a matrix for each row via cosine similarity in R

1 Answers1