1

I am working with a dataset df1 in R. It has two columns: Date and Student_ID. I first want to find out the number of pairs of students who have appeared together in a certain date. Suppose these pairs of students are in a column V. Now I want to create a dataframe with columns V such that corresponding to every Date, we have a 0 or 1 under every column if the pair of students appeared on the same date. I am having trouble writing this code as it is quite a complex data manipulation to come up with.

This is how the dataframe looks like:

> head(pair_combinations, 20)
# A tibble: 20 × 2
    DATE STUDENT_ID
      <int> <chr>      
 1 21500001 203145     
 2 21500001 200794     
 3 21500001 201143     
 4 21500001 2594       
 5 21500001 201952     
 6 21500001 200757     
 7 21500001 203934     
 8 21500001 203471     
 9 21500001 203488     
10 21500001 203118     
11 21500019 203145     
12 21500019 200794     
13 21500019 201143     
14 21500019 2594       
15 21500019 201952     
16 21500019 203471     
17 21500019 203934     
18 21500019 203118     
19 21500019 203501     
20 21500026 203145

I want the output like:

    DATE INTERACTION_203145_200794 INTERACTION_203145_201143 ...
      <int> <chr>      
 1 21500001          1                          1 
 2 21500019          1                          1 ...
...
thedumbkid
  • 125
  • 5
  • 3
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Aug 29 '23 at 15:02
  • @MrFlick, Thanks, I have included it now – thedumbkid Aug 29 '23 at 15:10
  • 1
    I think the goal is still obscure.. The example data is just the head but the expected output does not depend on that subset. You are committing the cardinal sin of MCVE's by expecting us to know what is in the three dots that are the column names. You should edit the example data to be smaller and to have some duplicated student ids. and then display all of the expected output. – IRTFM Aug 29 '23 at 15:28

1 Answers1

1

It's not pretty, but I'm guessing this might work

First, a helper function to get all pairs of values

all_pairs<-function(x) {
  if (length(x) >= 2) {
    sapply(combn(sort(unique(x)),2, simplify=FALSE), paste, collapse="_")
  } else {
    NA
  }
}

We sort the values so the smaller ID always comes first so we don't repeat pairs (it will always be A_B and not B_A).

Now we can do some reshaping

library(dplyr)
library(tidyr)

dd %>% 
  group_by(DATE) %>% 
  reframe(intx = all_pairs(STUDENT_ID)) %>% 
  mutate(val = 1) %>% 
  pivot_wider(names_from = intx, values_from = val, values_fill = 0)

Tested with

dd <- read.table(text="
    DATE STUDENT_ID
 1 21500001 203145     
 2 21500001 200794     
 3 21500001 201143     
 4 21500001 2594       
 5 21500001 201952     
 6 21500001 200757     
 7 21500001 203934     
 8 21500001 203471     
 9 21500001 203488     
10 21500001 203118     
11 21500019 203145     
12 21500019 200794     
13 21500019 201143     
14 21500019 2594       
15 21500019 201952     
16 21500019 203471     
17 21500019 203934     
18 21500019 203118     
19 21500019 203501     
20 21500026 203145", header=TRUE)
MrFlick
  • 195,160
  • 17
  • 277
  • 295