3

I want to summarize a data frame according to values in two columns, except each column actually has the same variable in it and I don't care what the order is. Here is a sample data frame:

> df<-data.frame(val1=c(1,1,1,2,2), val2=c(1,2,2,1,1), val3=c(10, 20, 30, 40, 50))
> df
  val1 val2 val3
1    1    1   10
2    1    2   20
3    1    2   30
4    2    1   40
5    2    1   50

I am interested in summarizing over val3 except I only care about the unordered combination of val1 and val2. So for instance I want to treat

  val1 val2 val3
1    1    2   20

the same as

  val1 val2 val3
1    2    1   40

I can use dplyr to group by columns and summarize like this:

> df %>% group_by(val1, val2) %>% summarize(mean=mean(val3))
Source: local data frame [3 x 3]
Groups: val1 [?]

   val1  val2  mean
  <dbl> <dbl> <dbl>
1     1     1    10
2     1     2    25
3     2     1    45

But I want this result to only contain two rows!

I've thought about creating a new variable that reflects the combination of val1 and val2 and then group by that, but can't figure out an easy way to this. val1 and val2 don't have to be numeric and likely won't always be consecutive integers.

oregano
  • 816
  • 9
  • 25
  • Also related: http://stackoverflow.com/q/15495795/ and http://stackoverflow.com/q/28047997/ – Frank Feb 09 '17 at 19:25

1 Answers1

8

We can do this with pmin and pmax to create the grouping variables

df %>%
    group_by(val_1 = pmin(val1, val2), val_2 = pmax(val1, val2)) %>% 
    summarise(val3 = mean(val3))
#   val_1 val_2  val3
#  <dbl> <dbl> <dbl>
#1     1     1    10
#2     1     2    35

As @Gregor mentioned in the comments, the use of changed names was intentional (val_1 vs. val1 or val_2 in place of val2) as the second statement pmax will evaluate the output from the 'val1' column if we had val1= pmin(val1, val2)

akrun
  • 874,273
  • 37
  • 540
  • 662