0

I have a dataset with success or not of some events for some users. An example with 2 users and 3 distinct events:

data.frame(
   id      = c('A', 'A', 'A', 'B', 'B', 'B'),
   event   = c('score', 'pass', 'dribble', 'score', 'pass', 'dribble'),
   success = c(1, 1, 1, 0, 1, 1)
)
# id  event success
# A   score       1
# A    pass       1
# A dribble       1
# B   score       0
# B    pass       1
# B dribble       1

I would like to mesure relation between events, how many times 2 successful events are present for a user. When event1 is achieved, event2 is it often achieved too? Event 1 and event 2 are they correlated?

In this example with 2 users, both achieved events 2 and 3, but only one achieved event 1. The expected output is:

data.frame(
   event1  = c('score', 'score', 'pass'),
   event2  = c('pass', 'dribble', 'dribble'),
   corr    = c(0.5, 0.5, 1) 
)
# event1   event2   corr
#  score     pass     .5
#  score  dribble     .5
#   pass  dribble      1

Such a table will help me to build a network, to weight and highlight links between the distinct events. Thank you in advance.

I can imagine a solution with a for loop, but I guess there is something more elegant. :)

demarsylvain
  • 2,103
  • 2
  • 14
  • 33

2 Answers2

0

Here is a way using some dplyr functions.

library(dplyr)
df <- data.frame(
   id      = c('A', 'A', 'A', 'B', 'B', 'B'),
   event   = c('score', 'pass', 'dribble', 'score', 'pass', 'dribble'),
   success = c(1, 1, 1, 0, 1, 1)
)

expand.grid(event1 = df$event, event2 = df$event) %>%  # generate all possible event combinations
  filter(event1 != event2) %>% # remove event1 = event2
  filter(!duplicated(data.frame(t(apply(.,1,sort))))) %>% # remove duplicates
  inner_join(df, by =c("event1" = "event")) %>% #adding the successes
  inner_join(df, by = c("event2" = "event"), suffix = c("_1", "_2")) %>%
  group_by(across(starts_with("event"))) %>%
  summarise(across(starts_with("success"), sum), .groups = "keep") %>%  # sum successes
  summarise(corr = success_2 / success_1, .groups = "drop") # calculate your corr

# A tibble: 3 x 3
  event1  event2  corr
  <chr>   <chr>  <dbl>
1 dribble pass     1  
2 dribble score    0.5
3 pass    score    0.5

The second filter line is adapted from this other Stack Overflow question: How to remove duplicates based on the combinations of two columns

Ben Norris
  • 5,639
  • 2
  • 6
  • 15
0

First of all, the correlations should be corr = c (0,0,1), because for score pass and score dribble event 1 involves event 2 half the time and the other half the time it doesn't.

Second, I'm going to duplicate your data (with a little change) because you can't calculate a correlation with just two rows (3 pairs of events with 2 rows doesn't works):

df <- rbind(df,mutate(df,success=1-success))

Under the assumption that the data is ordered by id and event occurrence, you can try this code:

library(dplyr)
library(purrr)
   
event_diff <- c(1L,2L)

my_fun <- function(ed,df){
      df %>% mutate(id2=lead(id,n=ed),
                    event2=lead(event,n=ed),
                    success2=lead(success,n=ed)) %>%
            filter(id==id2 ) %>%
            select(-id,-id2)}

events <- map_dfr(event_diff,my_fun,df=df)

map_dfr(split(events,paste(events$event,events$event2)),
        function(x)data.frame(event1=x$event[1],event2=x$event2[1],corr=cor(x$success,x$success2)))
Marcos Pérez
  • 1,260
  • 2
  • 7