1

I have two treatment groups in my data set and I am looking for a fast method for calculating the pairwise differences between observations in the first group and second group.

How can I quickly create all the combinations of observations and take their difference?

I think I might be able to get combinations of the subject ids by using expand.grid like so...

expand.grid(df$subjectID[df$treatment == 'Active'],
            df$subjectID[df$treatment == 'Placebo'])

and then I could join the outcome values based on subject ID and take their difference. I'd like a more generalized approach though if it is available.

I'm basically trying to calculate the Mann-Whitney U statistic from scratch so I need to determine if an outcome value in the active treatment group is greater than the outcome value in the placebo group (Y_a - Y_p > 0). In other words, I need to compare every response in the active treatment group to every response in the placebo treatment group.

So if I have some data that looks like this...

Subject Treatment   Outcome
1       Active      5
2       Active      7
3       Active      6
4       Placebo     2
5       Placebo     1

I want to calculate the difference matrix...

    S4  S5
S1  5-2 5-1
S2  7-2 7-1
S3  6-2 6-1

Here's some real data:

structure(list(subjectID = c(342L, 833L, 347L, 137L, 111L, 1477L
), treatment = c("CC + TV", "CC + TV", "CC + TV", "Control", 
"Control", "Control"), score_ch = c(2L, 3L, 2L, 3L, 0L, 0L)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

I got the results that I wanted via:

diff_df <- expand.grid('T_ID' = df$subjectID[df$treatment == 'CC + TV'],
            'C_ID' = df$subjectID[df$treatment == 'Control'])

tttt <- diff_df %>%
  left_join(df %>% select(subjectID, score_ch), by = c('T_ID' = 'subjectID')) %>%
  left_join(df %>% select(subjectID, score_ch), by = c('C_ID' = 'subjectID')) %>%
  mutate(val = case_when(score_ch.x == score_ch.y ~ 0.5,
                         score_ch.x > score_ch.y ~ 1,
                         score_ch.x < score_ch.y ~ 0))

But that kind of.. sucks..

Emma Jean
  • 507
  • 3
  • 12
  • Hi Emma, I think that you are trying to do calculations by group (ie dplyr's `group_by` or data.table's `by = `), but it's hard to tell without a sample of your data. Can you provide some with `dput`? – Ian Campbell Apr 01 '20 at 19:34
  • @IanCampbell Hi Ian, I've added some more detail. Hopefully that helps. – Emma Jean Apr 01 '20 at 19:40

1 Answers1

1

How about with base R outer?

Result <- outer(df[df$treatment == "Control",3],df[!df$treatment == "Control",3], FUN = '-')
colnames(Result) <- df[df$treatment == "Control","subjectID"]
rownames(Result) <- df[!df$treatment == "Control","subjectID"]
Result
#    137 111 1477
#342   1   0    1
#833  -2  -3   -2
#347  -2  -3   -2
Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
  • Oh wow, that almost seems too easy. I tried it and it worked! I'd like to retain the IDs but I am guessing if I do not reorder the vectors I should be able to reattach them correctly. – Emma Jean Apr 01 '20 at 20:14
  • 1
    Indeed, some of the old base R functions some in handy. I updated my answer to include `subjectID`. – Ian Campbell Apr 01 '20 at 20:19