I've got an input file with a list of ~50000 clusters and presence of a number of factors in each of them (~10 million entries in total), see a smaller example below:
set.seed(1)
x = paste("cluster-",sample(c(1:100),500,replace=TRUE),sep="")
y = c(
paste("factor-",sample(c(letters[1:3]),300, replace=TRUE),sep=""),
paste("factor-",sample(c(letters[1]),100, replace=TRUE),sep=""),
paste("factor-",sample(c(letters[2]),50, replace=TRUE),sep=""),
paste("factor-",sample(c(letters[3]),50, replace=TRUE),sep="")
)
data = data.frame(cluster=x,factor=y)
With a bit of help from another question, I got it to produce a piechart for co-occurrence of factors like this:
counts = with(data, table(tapply(factor, cluster, function(x) paste(as.character(sort(unique(x))), collapse='+'))))
pie(counts[counts>1])
But now I would like to have a venn diagram for the co-occurrence of factors. Ideally, also in a way that can take a threshold for the minimum count for each factor. For example, a venn diagram for the different factors so that each one of them has to be present n>10 in each cluster to be taken into account.
I've tried to find a way to produce the table counts with aggregate, but couldn't make it work.