How do I count the number of times any two given values occur together in a row in R?

Question

I am working with a data frame like this, with the ID column indicating a specific publication:

ID AuthorA AuthorB AuthorC
1   Chris   Lee     Jill
2   Jill    Tom     Lee 
3   Tom     Chris   Lee
4   Lee     Jill    NA
5   Jill    Chris   NA

I would like to generate a source, target, and count column for a social network analysis. In other words, count the number of times two authors appear on the same publication. The data frame I am working with, however, has 18 author columns. This should be the final output:

Source Target Count
Chris   Lee     2
Chris   Jill    2
Lee     Jill    3
Jill    Tom     1
Tom     Lee     2
Tom     Chris   1

Ronak Shah · Accepted Answer · 2020-12-09T08:24:22.297

3

For every row you can create all combination of names and count their frequency with table.

result <- stack(table(unlist(apply(df[-1], 1, function(x) {
                 vec <- na.omit(x)
                 if(length(vec) < 2) return(NULL)
                  combn(vec, 2, function(y) paste0(sort(y), collapse = '-'))
            }))))[2:1]
result
#         ind values
#1 Chris-Jill      2
#2  Chris-Lee      2
#3  Chris-Tom      1
#4   Jill-Lee      3
#5   Jill-Tom      1
#6    Lee-Tom      2

To get them in separate columns you can use separate :

tidyr::separate(result, ind, c('Source', 'Target'), sep = '-')

#  Source Target values
#2  Chris   Jill      2
#3  Chris    Lee      2
#4  Chris    Tom      1
#6   Jill    Lee      3
#7   Jill    Tom      1
#9    Lee    Tom      2

edited Dec 09 '20 at 08:24

answered Dec 08 '20 at 12:08

Ronak Shah

377,200
20
156
213

Hmmmm. if you used `collapse = "\t"` would `table` understand the desired column assignments? – Carl Witthoft Dec 08 '20 at 18:02
Sorry, I didn't get what you mean. Do you mean to use `collapse = "\t"` in `paste` and then have data in two columns using `read.table` ? – Ronak Shah Dec 09 '20 at 01:24
Thank you so much for taking the time to respond. I tried running this and was presented with this error: `Error in combn(na.omit(x), 2, function(y) paste0(sort(y), collapse = "-")) : n < m` – Frowin Becker Dec 09 '20 at 08:17
That did it! I cannot thank you enough. Really appreciate it. – Frowin Becker Dec 09 '20 at 09:08
Yes, I was too lazy to try myself :-( , but I did mean inserting a tab instead of the hyphen in the `paste` operation. – Carl Witthoft Dec 09 '20 at 12:43
@CarlWitthoft, I don't think `table` or `stack` are that clever :-) – A5C1D2H2I1M1N2O1R2T1 Dec 17 '20 at 06:23
@A5C1D2H2I1M1N2O1R2T1 nope! So I wrote a quick `strsplit` to create the separate name-columns. – Carl Witthoft Dec 17 '20 at 17:12

score 1 · Answer 2 · answered Dec 14 '20 at 05:04

Here's another approach in base R. It involves using crossprod which gets the frequencies you're after, just not in the format you're after.

Here's an example of what the crossprod output looks like:

crossprod(table(cbind(1:nrow(m), stack(lapply(m[, -1], as.character)))[, 1:2]))
#        values
# values  Chris Jill Lee Tom
#   Chris     3    2   2   1
#   Jill      2    4   3   1
#   Lee       2    3   4   2
#   Tom       1    1   2   2

Looking at the above, you can see that the values you're interested in are in the upper or lower triangle of the resulting matrix. You can use this information to write a function like the following to get your tabulations by pairs of values:

pair_table <- function(data) {
  m <- cbind(1:nrow(data), stack(lapply(data, as.character)))
  count <- crossprod(table(m[, 1:2]))
  count[upper.tri(count, diag = TRUE)] <- NA
  na.omit(as.data.frame(as.table(count)))
}

Using the function:

pair_table(m[, -1])
#    values values.1 Freq
# 2    Jill    Chris    2
# 3     Lee    Chris    2
# 4     Tom    Chris    1
# 7     Lee     Jill    3
# 8     Tom     Jill    1
# 12    Tom      Lee    2

Sample data used in this answer.

 m <- structure(list(ID = 1:5, AuthorA = structure(c(1L, 2L, 4L, 3L,                       
     2L), .Label = c("Chris", "Jill", "Lee", "Tom"), class = "factor"),                    
         AuthorB = structure(c(3L, 4L, 1L, 2L, 1L), .Label = c("Chris",                    
         "Jill", "Lee", "Tom"), class = "factor"), AuthorC = structure(c(1L,               
         2L, 2L, NA, NA), .Label = c("Jill", "Lee"), class = "factor")), row.names = c(NA, 
     5L), class = "data.frame")

How do I count the number of times any two given values occur together in a row in R?

2 Answers2