Summarizing the data manipulation problem I'm struggling with into a title was tough, but a short example summarizes this very well. I have a data table of airline related data, with columns for the origin and destination airports. My data looks like this:
my_data = collect(filter(flightdata, TailNum == t_n) %>%
select(airport1=Origin,airport2=Dest,ActualElapsedTime) %>%
group_by(airport1,airport2) %>%
summarize(ActualElapsedTime = mean(ActualElapsedTime)))
my_data
airport1 airport2 ActualElapsedTime
<chr> <chr> <dbl>
1 HNL ITO 51.67416
2 HNL KOA 44.00937
3 HNL LIH 37.98526
4 HNL OGG 37.52542
5 ITO HNL 50.02260
6 ITO OGG 39.22222
7 KOA HNL 43.52518
8 KOA OGG 34.20370
9 LIH HNL 33.47679
10 LIH OGG 47.06522
11 OGG HNL 35.28550
12 OGG ITO 37.57143
13 OGG KOA 31.61364
14 OGG LIH 45.46667
This table was the result of a group_by and summarize I did, also as shown above. However, I when I do my groupby and summarize, I want it to group (HNL ITO) and (ITO HNL) together, as opposed to separate (row 1 has (HNL, ITO), row 5 has (ITO, HNL)). The reason these grouped rows are separate in my output is because row1 corresponds to flights from HNL airport to ITO airport, and row5 corresponds to flights from ITO to HNL. My resulting dataframe would be grouped into 7 groups then, not 14.
Any thoughts on this would be greatly appreciated. I'm just learning dplyr and the %>% for piping and they're great, and I'd like to get this as well.
EDIT - I tried changing my select line to
select(airport1=pmin(Origin,Dest),airport2=pmax(Origin,Dest),ActualElapsedTime)
but it didn't work...