0

I have a dataset, a sample of which can be represented by the below:

    ORIGIN <- c("SFO", "BOS", "LAX", "BOS")
    DEST <- c("BOS", "SFO", "BOS", "LAX")
    PASS<- c(100,50,20,40)
    df = data.frame(ORIGIN, DEST, ROUTE, PASS)

ORIGIN DEST    PASS
SFO    BOS     100
BOS    SFO     50
LAX    BOS     20
BOS    LAX     40

I'm trying to combine ORIGIN and DEST together to form ROUTE, and then aggregate the rows by route, without regard to the origin or destination, to get total passenger numbers.

Ideally I would like the below:

ROUTE    PASS
SFO-BOS  150
LAX-BOS  60

I don't care about which route naming metric prevails, as long as I can combine them.

As a first step, I've used this:

df$ROUTE<-paste(df$ORIGIN, df$DEST, sep="-")

To get the below:

ORIGIN DEST   ROUTE    PASS
SFO    BOS    SFO-BOS  100
BOS    SFO    BOS-SFO  50
LAX    BOS    LAX-BOS  20
BOS    LAX    BOS-LAX  40

However while I would usually use something like ddply to get the passenger sum, I'm unsure of how to proceed since I'm trying to combine "BOS-SFO" and "SFO-BOS".

This would not be an issue in a small file, but the file I'm working with is excessively large.

Would someone be able to help?

  • Relevant - http://stackoverflow.com/questions/35834385/create-unique-identifier-from-the-interchangeable-combination-of-two-variables/35834584 or http://stackoverflow.com/questions/25297812/pair-wise-duplicate-removal-from-dataframe/25298863 or http://stackoverflow.com/questions/25145982/extract-unique-rows-from-a-data-table-with-each-row-unsorted/25151395 – thelatemail Feb 07 '17 at 22:46
  • Your `paste()` can still work, but you need to sort the route first, `paste(apply(df[,1:2],1,max),apply(df[,1:2],1,min),sep="-") ` Then you can simply aggregate by route `aggregate(PASS ~ ROUTE, data=df, sum)` – gsun Feb 07 '17 at 22:53

2 Answers2

1

This is what pmax and pmin are for. Using dplyr:

df %>% 
  mutate(start = pmin(ORIGIN, DEST),
         end = pmax(ORIGIN, DEST)) %>% 
  group_by(start, end) %>% 
  summarize(PASS = sum(PASS))

#  start   end  PASS
# 1   BOS   LAX    60
# 2   BOS   SFO   150
GGamba
  • 13,140
  • 3
  • 38
  • 47
  • Yep. Base translation - `df$ROUTE <- interaction(do.call(pmin, lapply(df[c("ORIGIN","DEST")],as.character)), do.call(pmax, lapply(df[c("ORIGIN","DEST")],as.character)))` and `aggregate(PASS ~ ROUTE, data=df, sum)` – thelatemail Feb 07 '17 at 22:44
0

Using mapply in base R to combine the results,

# sort each pair of dest and origin by row and paste results.
df$ROUTE <- mapply(function(x, y) paste(sort(c(x, y)), collapse="-"),
                   df$ORIGIN, df$DEST, USE.NAMES=FALSE)

aggregate(PASS ~ ROUTE, data=df, sum)
    ROUTE PASS
1 BOS-LAX   60
2 BOS-SFO  150

data

Note that this method requires working with character variables rather than factor variables. The data.frame, df, is constructed as below.

df = data.frame(ORIGIN, DEST, PASS, stringsAsFactors=FALSE)
lmo
  • 37,904
  • 9
  • 56
  • 69