0

I need a connection data frame with intensity for connection using words. The data looks like this:

df1 <- c("test", "example", "random word", "another")
df2 <- c("word2", "word3", "test")
df3 <- c("word2", "test", "question", "stack", "overflow")
df4 <- c("word2", "no", "yes", "vector")

Ideally, I should get something like this:

links <- data.frame(
  source=c("df1","df2", "df3", "df4"), 
  target=c("df1","df2", "df3", "df4"), 
  value=c(1,2, 2, 1)
  )

The idea is to create a sankey diagram as explained here (https://www.r-graph-gallery.com/321-introduction-to-interactive-sankey-diagram-2.html) based on the similarity between datasets. However, I do not figure out:

(1) How to calculate the similarity between word vectors across several datasets (2) How to create a distance matrix based on this similarity with the result of each dataset-pair distance

The problem is not so much about how to calculate distances, but about how to do it among different datasets (the example only has 4, but I have more than 70) and store the results in a single matrix.

ccfarre
  • 49
  • 1
  • 1
  • 5
  • Maybe [this](https://stackoverflow.com/questions/48311711/calculate-cosine-similarity-of-two-words-in-r) helps. – Andre Wildberg Nov 19 '21 at 18:00
  • Not sure what type of distance you look for. You can use function `adist` for generalized Levenshtein distance. You can consider functions in package `GrpString` to calculate the similarity between word vectors – bdedu Nov 19 '21 at 18:12
  • Thank you for your responses, I edited the question to make it more focused. The problem is not about "how" to calculate distances, but about how to do it with many differents datasets and store the results in a single matrix. – ccfarre Nov 19 '21 at 18:31

0 Answers0