I've got a long list of authors and words something like
author1,word1
author1,word2
author1,word3
author2,word2
author3,word1
The actual list has hundreds of authors and thousands of words. It exists as a CSV file which I have read into a dataframe and de-duplicated like
> typeof(x)
[1] "list"
> colnames(x)
[1] "author" "word"
The last bit of dput(head(x)) looks like
), class = "factor")), .Names = c("author", "word"), row.names = c(NA,
6L), class = "data.frame")
What I'm trying to do is calculate how similar the word lists are between authors based on intersection of the author's wordlists as a percentage of one authors total vocabulary. (I'm sure there are proper terms for what I'm doing but I don't quite know what they are.)
In python or perl I would group all the words by author and use nested loops to compare everyone with everyone else but I'm wondering how I would do that in R? I have a feeling that "use apply" is going to be the answer- if it is can you please explain it in small words for newbies like me?