2

I have a data frame which looks something like this:

id  val
1    a
1    b
2    a
2    c
2    d
3    a
3    a

think of each row as a label, val, that was given to some observation with an id.

What I ultimately want to get to is a "co-occurence" matrix that looks something like this where I get a count of how many times each letter appears within the same id with each other letter:

    a  b  c  d
a   1  1  1  1
b   1  0  0  0
c   1  0  0  1
d   1  0  1  0

I've been wracking my brain looking for ways to do this, but have come up empty so far. Any hints? Preferably using tidyverse tools, but open to other options as well at this point.

EDIT: the solutions to the question linked as a possible duplicate do not work in this case. I'm not sure why, but I suspect it has to do with that question having a data frame with 3 columns.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Dave Kincaid
  • 3,970
  • 3
  • 24
  • 32
  • 2
    Possible duplicate of [Creating co-occurrence matrix](https://stackoverflow.com/questions/13281303/creating-co-occurrence-matrix) – missuse Sep 27 '17 at 19:07
  • 1
    That question is 5 years old, so I hope that there is a more straightforward solution now. I tried 2 or 3 of the solutions there and none of them work, so my question is different (maybe because it lacks a third column?) – Dave Kincaid Sep 27 '17 at 19:31
  • I've just noticed that my original solution is very similar to @d.b's, so I have changed it to add some value. – acylam Sep 28 '17 at 14:21
  • Don't forget to accept the best answer if it solves your problem:) – acylam Oct 02 '17 at 18:52

2 Answers2

1

Here's a solution in base R. Not quite elegant but seems to work

temp = data.frame(do.call(cbind, lapply(split(df, df$id), function(a)
    combn(a$val, 2))), stringsAsFactors = FALSE)
sapply(sort(unique(df$val)), function(rows)
    sapply(sort(unique(df$val)), function(cols)
        sum(sapply(temp, function(x)
            identical(sort(x), sort(c(rows, cols)))))))
#  a b c d
#a 1 1 1 1
#b 1 0 0 0
#c 1 0 0 1
#d 1 0 1 0

OR with igraph

temp = t(do.call(cbind, lapply(split(df, df$id), function(a) combn(a$val, 2))))
library(igraph)
as.matrix(get.adjacency(graph(temp, directed = FALSE)))
#  a c b d
#a 1 1 1 1
#c 1 0 0 1
#b 1 0 0 0
#d 1 1 0 0

DATA

df = structure(list(id = c(1L, 1L, 2L, 2L, 2L, 3L, 3L),
                    val = c("a", "b", "a", "c", "d", "a", "a")),
               .Names = c("id", "val"),
               class = "data.frame",
               row.names = c(NA, -7L))
d.b
  • 32,245
  • 6
  • 36
  • 77
0

A solution with dplyr + purrr:

library(dplyr)
library(purrr)
df %>%
  split(.$id) %>%
  map_dfr(function(x){
    t(combn(x$val, 2)) %>% 
      data.frame(stringsAsFactors = FALSE)
  }) %>%
  mutate_all(funs(factor(., levels = c("a", "b", "c", "d")))) %>%
  table() %>%
  pmax(., t(.))

Result:

   X2
X1  a b c d
  a 1 1 1 1
  b 1 0 0 0
  c 1 0 0 1
  d 1 0 1 0

Notes:

  1. I first split the df by id, then used map_dfr from purrr to map the combn function to each id group.
  2. combn finds all combinations of elements within a vector (length(vec) choose 2) and returns a matrix.
  3. _dfr at the end of map_dfr means that the result will be a dataframe by row binding each element of the list. So this is effectively do.call(rbind, lapply()).
  4. mutate_all makes sures that table retains all the levels needed even if a letter does not exist in a column.
  5. Finally, since after the table step, an upper triangular matrix is produced, I fed that matrix and its transpose into pmax
  6. pmax finds the parallel maxima from the two inputs and returns a symmetric matrix as desired.

Data:

df = read.table(text=  "id  val
                1    a
                1    b
                2    a
                2    c
                2    d
                3    a
                3    a", header = TRUE, stringsAsFactors = FALSE)
acylam
  • 18,231
  • 5
  • 36
  • 45