-1

I am hoping to efficiently compute a co-occurence matrix by finding the co-occurences between two different variables within a group, ideally without using a complex loop that iterates through all possible combinations.

Given that my dataframe looks as follows:

df = data.frame(group = c(1,1,1,2,2,2),var1 = c(1,2,4,2,2,4),var2 = c(4,1,2,1,3,2))

> df
  group var1 var2
1     1    1    4
2     1    2    1
3     1    4    2
4     2    2    1
5     2    2    3
6     2    4    2

I am hoping to turn this into a new co-occurence matrix, where the rows represent var1 and columns var2.

EDIT: For those unfamiliar with co-occurences, I am interested in pairs of values that occur simultaneously in a group. For example, the combination of "2" and "1" happens once in group 1, and other time in group 2, thus implying 2 co-occurences. In my example, I put the combination next two each other, but they could occur anywhere within the group.

It should look like the following:

> cooc
  1 2 3 4
1 0 2 0 1
2 2 0 1 2
3 0 1 0 0
4 1 2 0 0

I have done this before when dealing with co-occurences using just one variable within a group by using the xtabs function, but not sure how to apply it to multiple columns. For example, if I was interested in finding the co-occurences for var1 within the different groups, I would do the following:

> td = xtabs(~group + var1,data = df)
> cooc = crossprod(td,td)
> diag(cooc) = 0
Boudewijn Aasman
  • 1,236
  • 1
  • 13
  • 20
  • maybe explain what exactly a co-occurence is and how to derive the result for users who may not know that term – road_to_quantdom Dec 03 '15 at 19:25
  • Good point. I edited my post to explain exactly what I meant with the term "co-occurence". – Boudewijn Aasman Dec 03 '15 at 19:33
  • 1
    what is the significance of `group`? I am not fully understanding what needs to be done. Won't the result be the same regardless of the grouping? If not, could you give an example where it would depend on the grouping – road_to_quantdom Dec 03 '15 at 20:25

1 Answers1

1

if i am understanding your question correctly, I believe this should work:

# i only use data.table here in case we need to do this "by group"
# but in this solution I do not use it as i did not see the significance
# of grouping
###library(data.table)
###df <-  data.table(df)

# this creates the pair of values "a_b"
df$ID <- paste(df$var1,df$var2,sep="_")
# we enumerate all the unique values that way we can create 
# a map to later match the data and map
uniqval <- sort(unique(c(df$var1,df$var2)))
grid <- expand.grid(uniqval,uniqval)
grid$ID <- paste(grid$Var1,grid$Var2,sep="_")
# match our data to this map
matches <- sort(match(df$ID,grid$ID))
# tabulate our results into a dataframe
tab <- data.frame(table(grid$ID[matches]))
# split up our ID back into values
tab$Var2 <- substr(tab$Var1,3,3)
tab$Var1 <- substr(tab$Var1,1,1)
# create our empty result matrix
cooc <- matrix(0,nrow=length(uniqval),ncol=length(uniqval))
rownames(cooc) <- uniqval
colnames(cooc) <- uniqval

# there are other ways to do this
# but this seemed simple enough of a loop for me
# we just need to replace the tabulation results
# into our desired location in the matrix
# namely, "a_b" frequencies into [a,b] and [b,a] positions
for(m in 1:nrow(tab)){

  i <- tab$Var1[m]
  j <- tab$Var2[m]

# by adding this to the previous value
# we are accounting for "a_b" equiv. to "b_a"
  cooc[i,j] <- cooc[i,j]+tab$Freq[m]
  cooc[j,i] <- cooc[i,j]

}
road_to_quantdom
  • 1,341
  • 1
  • 13
  • 20
  • This worked, thank you very much. I had to split up the expand grid with my actual dataset due to running into memory problems. – Boudewijn Aasman Dec 04 '15 at 19:16
  • oh yeah, expand.grid can quickly grow out of proportion if the number of variables is large. i have 32GB of RAM so I didn't even consider that possibility. glad it helps – road_to_quantdom Dec 05 '15 at 03:41