I need a connection data frame with intensity for connection using words. The data looks like this:
df1 <- c("test", "example", "random word", "another")
df2 <- c("word2", "word3", "test")
df3 <- c("word2", "test", "question", "stack", "overflow")
df4 <- c("word2", "no", "yes", "vector")
Ideally, I should get something like this:
links <- data.frame(
source=c("df1","df2", "df3", "df4"),
target=c("df1","df2", "df3", "df4"),
value=c(1,2, 2, 1)
)
The idea is to create a sankey diagram as explained here (https://www.r-graph-gallery.com/321-introduction-to-interactive-sankey-diagram-2.html) based on the similarity between datasets. However, I do not figure out:
(1) How to calculate the similarity between word vectors across several datasets (2) How to create a distance matrix based on this similarity with the result of each dataset-pair distance
The problem is not so much about how to calculate distances, but about how to do it among different datasets (the example only has 4, but I have more than 70) and store the results in a single matrix.