1

Summarizing the data manipulation problem I'm struggling with into a title was tough, but a short example summarizes this very well. I have a data table of airline related data, with columns for the origin and destination airports. My data looks like this:

my_data = collect(filter(flightdata, TailNum == t_n) %>%
                select(airport1=Origin,airport2=Dest,ActualElapsedTime) %>% 
                group_by(airport1,airport2) %>%
                summarize(ActualElapsedTime = mean(ActualElapsedTime)))

my_data
     airport1   airport2     ActualElapsedTime
      <chr>    <chr>             <dbl>
1       HNL      ITO          51.67416
2       HNL      KOA          44.00937
3       HNL      LIH          37.98526
4       HNL      OGG          37.52542
5       ITO      HNL          50.02260
6       ITO      OGG          39.22222
7       KOA      HNL          43.52518
8       KOA      OGG          34.20370
9       LIH      HNL          33.47679
10      LIH      OGG          47.06522
11      OGG      HNL          35.28550
12      OGG      ITO          37.57143
13      OGG      KOA          31.61364
14      OGG      LIH          45.46667

This table was the result of a group_by and summarize I did, also as shown above. However, I when I do my groupby and summarize, I want it to group (HNL ITO) and (ITO HNL) together, as opposed to separate (row 1 has (HNL, ITO), row 5 has (ITO, HNL)). The reason these grouped rows are separate in my output is because row1 corresponds to flights from HNL airport to ITO airport, and row5 corresponds to flights from ITO to HNL. My resulting dataframe would be grouped into 7 groups then, not 14.

Any thoughts on this would be greatly appreciated. I'm just learning dplyr and the %>% for piping and they're great, and I'd like to get this as well.

EDIT - I tried changing my select line to

select(airport1=pmin(Origin,Dest),airport2=pmax(Origin,Dest),ActualElapsedTime)

but it didn't work...

Canovice
  • 9,012
  • 22
  • 93
  • 211
  • This looks like a similar question: http://stackoverflow.com/questions/28992028/grouping-over-all-possible-combinations-of-several-variables-with-dplyr – conrad-mac Jan 29 '17 at 02:10
  • will take a look thanks – Canovice Jan 29 '17 at 02:12
  • I am not grouping over all combinations of three variables like in that post. I am trying to tell dplyr "treat HNL ITO the same as ITO HNL, even though they are in different columns". I think the best way to do so is to set airport1 and airport2 correctly, like how i tried in my select statement – Canovice Jan 29 '17 at 02:20

1 Answers1

2

One way to do it is creating a new variable, containing each pairing of airport1 and airport2, then summarizing the data using it as a grouping variable.

In this case, we create a variable called combination by pasting in alphabetical order the names of airport1 and airport2, so we have a unique value for each pairing.

my_data <- collect(
  flightdata %>% 
    filter(TailNum == t_n) %>%
    select(airport1 = Origin, airport2 = Dest, ActualElapsedTime) %>% 
    mutate(combination = ifelse(airport1 < airport2, 
                                paste(airport1, airport2),
                                paste(airport2, airport1))
    ) %>% 
    group_by(combination) %>% 
    summarize(ActualElapsedTime = mean(ActualElapsedTime))
  )
Juan Bosco
  • 1,420
  • 5
  • 19
  • 23