R: Create a pairwise matrix of counts of shared values between items in one column and features in another

Question

I have a data thats's a bit like this:

items <- c("A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C", "C", "D", "D")
features <- c("ab", "ac", "ad", "ab", "ab", "az", "ay", "az", "az", "al", "ab", "ad", "aa", "ac")
df <- data.frame(items, features)

Which gives a data frame like:

   items features
1      A       ab
2      A       ac
3      A       ad
4      A       ab
5      B       ab
6      B       az
7      B       ay
8      C       az
9      C       az
10     C       al
11     C       ab
12     C       ad
13     D       aa
14     D       ac

I would like to create two pairwise comparisons from the above data frame. The first is a comparison of every item with every other item that counts how many features are shared between them. The second is a comparison of every item with every other item that gives a character string of the shared features (separated by spaces, for example).

I have been able to do this using a loop with another loop within it to compare "A" with "B" and so on, and then "B" with all others, and so on, but this is a very slow process. The real data frame has ~2000 items in it and the compute time gets out of hand pretty quickly as the dataset grows.

The code I have used is like this:

item_list <- unique(df$items)
feature_count <- data.frame(matrix(ncol = length(item_list), nrow = length(item_list)))
colnames(feature_count) <- item_list
rownames(feature_count) <- item_list
feature_details <- data.frame(matrix(ncol = length(item_list), nrow = length(item_list)))
colnames(feature_details) <- item_list
rownames(feature_details) <- item_list
for (n in 1:length(item_list)){
  item <- df[df$item == item_list[n],]
  item_features <- as.list(item$features)
  for (z in 1:length(item_list)){
    comparison <- df[df$item == item_list[z],]
    comparison_features <- as.list(comparison$features)
    if (length(intersect(item_features, comparison_features)) == 0) {
      feature_count[z,n] <- length(intersect(item_features, comparison_features))
      feature_details[z,n] <- NA
    } else {
      feature_count[z,n] <- length(intersect(item_features, comparison_features))
      feature_details[z,n] <- paste(intersect(item_features, comparison_features), collapse = " ")
    }
  }
  diag(feature_count) <- 0
  diag(feature_details) <- NA
}

And returns two data frames like this:

feature_count

feature_details

      A     B     C    D
A  <NA>    ab ab ad   ac
B    ab  <NA> az ab <NA>
C ab ad ab az  <NA> <NA>
D    ac  <NA>  <NA> <NA>

The above seems like an inelegant and inefficient way of doing this. Could anyone offer any advice on a simpler approach to achieve the same thing that will make working with much, much larger datasets more doable?

This is a co-occurrence matrix with a zero-ed diagonal. There are simpler options here - https://stackoverflow.com/questions/19977596/how-do-i-calculate-the-co-occurrence-in-the-table/19977766 I believe. E.g., this code works adapting from there - `out <- tcrossprod(table(unique(dat))); diag(out) <- 0; out` . Other alternatives using a sparse matrix as per here - https://stackoverflow.com/questions/71506825/frequent-product-pair-given-transaction-in-r/71506933 - may adapt better to bigger data too. Or use the Rcpp solution from the first question. — thelatemail, Aug 11 '23 at 00:06

score 2 · Accepted Answer · answered Aug 10 '23 at 23:20

Here's a solution using outer(). First, make the data:

items <- c("A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C", "C", "D", "D")
features <- c("ab", "ac", "ad", "ab", "ab", "az", "ay", "az", "az", "al", "ab", "ad", "aa", "ac")
df <- data.frame(items, features)

Split the data into groups baed on items, using split() will retain the group names as the names of the elements. Also save the names of the list elements as a vector called n_split.

items_spl <- split(df, items)
n_split <-names(items_spl)

The idea using outer() is to generate the outer product of n_split and n_split which will make a matrix that has the list names on the row and column. The function we use here isn't the product, but we are taking the length of the intersection of the unique features from each item's list. That gives the count matrix. Pasting together the intersection of unique features gives the shared feature matrix.

count <-outer(n_split,
      n_split, 
      Vectorize(
        function(x,y)length(intersect(unique(items_spl[[x]]$features), 
                                      unique(items_spl[[y]]$features)))))
diag(count) <-0

shared <-outer(n_split,
              n_split, 
              Vectorize(
                function(x,y)paste(intersect(unique(items_spl[[x]]$features), 
                             unique(items_spl[[y]]$features)), 
                   collapse=" ")))
diag(shared) <- ""
rownames(count) <- colnames(count) <- 
rownames(shared) <- colnames(shared) <- n_split

count
#>   A B C D
#> A 0 1 2 1
#> B 1 0 2 0
#> C 2 2 0 0
#> D 1 0 0 0

shared
#>   A       B       C       D   
#> A ""      "ab"    "ab ad" "ac"
#> B "ab"    ""      "ab az" ""  
#> C "ab ad" "az ab" ""      ""  
#> D "ac"    ""      ""      ""

^{Created on 2023-08-11 with reprex v2.0.2}

Thank you very much for this! I'll give this a try and let you know how I go. — Shaun Lehmann, Aug 11 '23 at 00:07
This worked perfectly and reduced the compute time HUGELY. Thank you. — Shaun Lehmann, Aug 13 '23 at 23:46
@ShaunLehmann Any time you can turn a loop into a matrix operation, that generally results in a huge efficiency gain. — DaveArmstrong, Aug 14 '23 at 00:38

R: Create a pairwise matrix of counts of shared values between items in one column and features in another

1 Answers1