How to summarise one set of id values in a dataframe grouped by another set of ids

Question

Having gotten my data into the format:

    pId   fId   
1   1     0     
2   1     108   
3   1     940   
4   1     972   
5   1     993   
6   2     0     
7   3     0     
8   3     32    
9   3     108   
10  3     176

My goal is to try and (for a much longer set of data) determine which fIds each pId has in common with each other, and from that how many they have in common. My plan was to try and summarise into singular rows of pId where each fId is a list of fIds, and then loop a function like intersect() or of similar nature across that, for an ideal ouput of format:

   pId1   pId2  together
1   1     2     1
2   1     3     2
3   1     4     N
4   2     3     1

etc....

EDIT: trying to work with the data in one of these ways

   pId  allfId                          allfIdSplit
1   1   0,901,940,972,993               c("0", "901", "940", "972", "993")
2   2   0                               0
3   3   0,32,108,176                    c("0", "32", "108", "176")
4   4   0,200,561,602,629,772,825,991   c("0", "200", "561", "602", "629", "772", "825", "991")
5   5   0                               0

With code that I had so far, where df_a is startng point as shown above to give the output shown in the edit:

df_c <- df_a %>%
  group_by(pId) %>%
  arrange(pId) %>%
  summarize(allFlights = paste(unique(flightId), collapse = ",")) %>%
  mutate(allFlightsSplit = str_split(allFlights, ",")) %>%
  print()

Welcome to StackOverflow! Your question generally is about associations, and specifically about "pairwise counts" and "pairwise correlations". There are already quite a few different [answers on SO](https://stackoverflow.com/search?q=%5Br%5D+pairwise+counts) about pairwise counts. Do they help answer your question? — ravic_, Nov 24 '19 at 17:08
Thanks, I've not heard of them, but I'll start by giving those a look at! — Alex, Nov 24 '19 at 17:12
Searching the R tag for "self join", I'm mostly finding `data.table` solutions. Since we don't have the code you're using, I don't know if that works for you, but take a look [here](https://stackoverflow.com/q/14983805/5325862), [here](https://stackoverflow.com/q/15776064/5325862), [here](https://stackoverflow.com/q/58049297/5325862) for starters — camille, Nov 24 '19 at 17:21
I've added a bit of the code as well, but will check those out — Alex, Nov 24 '19 at 17:34

jdobres · Answer 1 · 2019-11-26T01:38:07.673

Here's one way to do it in the tidyverse. See comments in the code.

library(tidyverse)
library(magrittr)

df.counts <- combn(unique(df$pId), 2) %>% # unique combinations of pIDs
  t %>% # transform to columns
  as.data.frame() %>% # to data frame
  set_colnames(c('pId1', 'pId2')) %>%  # name the columns
  left_join(df, by = c(pId1 = 'pId')) %>% # join the original data to pId1
  left_join(df, by = c(pId2 = 'pId')) %>% #join original data to pId2
  filter(fId.x == fId.y) %>% # get rid of duplicates
  count(pId1, pId2) # count

  pId1  pId2     n
  <int> <int> <int>
1     1     2     1
2     1     3     2
3     2     3     1

An alternative using loops

Loops are usually not the best way to handle these types of problems in R, but since operations like combn seem to be too expensive on your real data, this may be more performant.

pids <- unique(df$pId)

result <- list()

for (x in pids) {
  for (y in setdiff(pids, x)) {
    x.vals <- df$fId[df$pId == x]
    y.vals <- df$fId[df$pId == y]
    together <- length(intersect(x.vals, y.vals))
    result[[length(result) + 1]] <- data.frame(pId1 = x, pId2 = y, together = together)
  }
}

df.new <- do.call(rbind, result)

  pId1 pId2 together
1    1    3        2
2    2    3        1
3    3    2        1

And here is a version that preallocates the size of the final data frame, which may be even more performant:

pids <- unique(df$pId)
result <- data.frame(pId1 = rep(NA, length(pids) * (length(pids) - 1) / 2), pId2 = NA, together = NA)
row.num <- 1
for (x in pids) {
  for (y in setdiff(pids, x)) {
    x.vals <- df$fId[df$pId == x]
    y.vals <- df$fId[df$pId == y]
    together <- length(intersect(x.vals, y.vals))
    result[row.num, 'pId1'] <- x
    result[row.num, 'pId2'] <- y
    result[row.num, 'together'] <- together
    row.num <- row.num + 1
  }
}

seems to be having issues with set_colnames and "combn(unique(df_q4_a$passengerId), 2)" which it seems to be computing endlessly. Maybe the dataset is too large — Alex, Nov 24 '19 at 17:30
`combn` could get very expensive if you have a huge number of unique pIds. See my loop-based alternative. — jdobres, Nov 24 '19 at 17:47
I'll give it a go when my pc is done rebooting, combn might have caused it to crash and update! — Alex, Nov 24 '19 at 17:53
You could use ```Rfast::comb_n```. Then your approach using tidyverse would be fast enough. — Manos Papadakis, Nov 24 '19 at 18:01
sadly both methods still seem a little poorly performant, and I'd prefer to avoid using Rfast if I can for now as I'd prefer to stick with core and tidyverse if I can. — Alex, Nov 24 '19 at 18:23
seems like even using Rfast::comb_n doesn't work quick enough. I think the real bottleneck is operating on the output table from the initial combination — Alex, Nov 24 '19 at 19:31

How to summarise one set of id values in a dataframe grouped by another set of ids

1 Answers1

An alternative using loops