Calculating intersection of lots of sets in R

Question

I've got a long list of authors and words something like

author1,word1
author1,word2
author1,word3
author2,word2
author3,word1

The actual list has hundreds of authors and thousands of words. It exists as a CSV file which I have read into a dataframe and de-duplicated like

    > typeof(x)
    [1] "list"
    > colnames(x)
    [1] "author"   "word"

The last bit of dput(head(x)) looks like

    ), class = "factor")), .Names = c("author", "word"), row.names = c(NA, 
    6L), class = "data.frame")

What I'm trying to do is calculate how similar the word lists are between authors based on intersection of the author's wordlists as a percentage of one authors total vocabulary. (I'm sure there are proper terms for what I'm doing but I don't quite know what they are.)

In python or perl I would group all the words by author and use nested loops to compare everyone with everyone else but I'm wondering how I would do that in R? I have a feeling that "use apply" is going to be the answer- if it is can you please explain it in small words for newbies like me?

That's not a description that has a single underlying representation. It's probably a list as you say but the structure is not obvious. Can you edit your post to include the output of `dput(head(list_name))`? — IRTFM, Apr 22 '16 at 03:03
You might, also, find [this post](http://stackoverflow.com/questions/19891278/r-table-of-interactions-case-with-pets-and-houses) helpful; `tcrossprod(table(x)) / rowSums(table(x))`. — alexis_laz, Apr 22 '16 at 08:53
Thanks alexis- but tcrossprod() only works with equal numbers of rows and columns doesn't it? — rw950431, Apr 23 '16 at 11:38

bgoldst · Accepted Answer · 2016-04-22T03:59:21.297

Here's one way to do it using data.table:

## 1: generate test data
set.seed(1L);
wordList <- paste0('word',1:5);
authorList <- paste0('author',1:5);
rs <- sample(1:5,length(authorList),replace=T);
aw <- data.table(
    author=factor(rep(authorList,rs)),
    word=factor(do.call(c,lapply(rs,function(r) sort(sample(wordList,r))))),
    key='author'
);
aw;
##      author  word
##  1: author1 word4
##  2: author1 word5
##  3: author2 word3
##  4: author2 word4
##  5: author3 word1
##  6: author3 word4
##  7: author3 word5
##  8: author4 word1
##  9: author4 word2
## 10: author4 word3
## 11: author4 word4
## 12: author4 word5
## 13: author5 word2
## 14: author5 word5

## 2: initialize intersection table with unique combinations of authors
ai <- aw[,setkey(setNames(nm=c('a1','a2'),as.data.table(t(combn(unique(author),2L)))))];

## 3: compute word intersection size for each combination of authors
ai[,int:=length(intersect(aw[a1,word],aw[a2,word])),key(ai)];
##          a1      a2 int
##  1: author1 author2   1
##  2: author1 author3   2
##  3: author1 author4   2
##  4: author1 author5   1
##  5: author2 author3   1
##  6: author2 author4   2
##  7: author2 author5   0
##  8: author3 author4   3
##  9: author3 author5   1
## 10: author4 author5   2

## 4: compute percentages
ai[,`:=`(p1=int/aw[a1,.N],p2=int/aw[a2,.N]),key(ai)];
##          a1      a2 int        p1        p2
##  1: author1 author2   1 0.5000000 0.5000000
##  2: author1 author3   2 1.0000000 0.6666667
##  3: author1 author4   2 1.0000000 0.4000000
##  4: author1 author5   1 0.5000000 0.5000000
##  5: author2 author3   1 0.5000000 0.3333333
##  6: author2 author4   2 1.0000000 0.4000000
##  7: author2 author5   0 0.0000000 0.0000000
##  8: author3 author4   3 1.0000000 0.6000000
##  9: author3 author5   1 0.3333333 0.5000000
## 10: author4 author5   2 0.4000000 1.0000000

That worked ok on your test dataset but when run against my real data I get the following error at step 3:- 'Error in `[.data.table`(aw, a2, word) : When i is a data.table (or character vector), x must be keyed (i.e. sorted, and, marked as sorted) so data.table knows which columns to join to and take advantage of x being sorted. Call setkey(x,...) first, see ?setkey.' Any idea of what this is trying to tell me? — rw950431, Apr 22 '16 at 05:06
Unlike data.frames, data.tables can have a key set internally on the object. But keying is optional. Some types of syntax require a key. In my answer, I explicitly set a key on`aw` when I create it using the argument `key='author'`. On your data you probably coerced it to data.table using either `as.data.table()` or `setDT()`, without specifying a key. Step 3 requires a key because it indexes `aw` using just a character vector (`a1` for the first index, `a2` for the second). You can solve the problem by assigning the key after coercing by calling `setkey(aw,author);`. — bgoldst, Apr 22 '16 at 05:29
That fixed it.. all works now. So much R-awesomeness in 3 lines — rw950431, Apr 22 '16 at 06:13

Calculating intersection of lots of sets in R

1 Answers1