"Dummy" coding a factor that has two values in R

Question

I'm not quite sure if there is a better way to say what I'm asking. Basically I have route data (for example LAX-BWI, SFO-JFK, etc). I want to dummy it so I basically would have a 1 for every airport that a flight touches (directionality doesn't matter so LAX-BWI is the same as BWI-LAX).

So for example:

     ROUTE | OFF |  ON |  
    LAX-BWI|10:00|17:00|  
    LAX-SFO|11:00|13:00|  
    BWI-LAX|18:00|01:00|   
    BWI-SFO|15:00|20:00|

becomes

    BWI|LAX|SFO| OFF |  ON |  
     1 | 1 | 0 |10:00|17:00|  
     0 | 1 | 1 |11:00|13:00|  
     1 | 1 | 0 |18:00|01:00|  
     1 | 0 | 1 |15:00|20:00|

I can either pull in the data as a string "BWI-LAX" or have two columns Orig and Dest whose values are string "BWI" and "LAX".

The closest thing I can think of is dummying it, but if there is an actual term for what I want, please let me know. I feel like this has been answered, but I can't think of how to search for it.

Among other options, `library(tidyverse); df %>% separate_rows(ROUTE) %>% mutate(n = 1) %>% spread(ROUTE, n, fill = 0)` — alistaire, Nov 20 '17 at 18:46

score 1 · Answer 1 · answered Nov 20 '17 at 18:36

1

Someone just asked a very similar question so I'll copy my answer from here:

allDest <- sort(unique(unlist(strsplit(dataFrame$ROUTE, "-"))))
for(i in allDest){
  dataFrame[, i] <- grepl(i, dataFrame$ROUTE)
}

This will create one new column for every airport in the set and indicate with TRUE or FALSE if a flight touches an airport. If you want 0 and 1 instead you can do:

allDest <- sort(unique(unlist(strsplit(dataFrame$ROUTE, "-"))))
for(i in allDest){
  dataFrame[, i] <- grepl(i, dataFrame$ROUTE)*1
}

TRUE*1 is 1 FALSE*1 is 0.

answered Nov 20 '17 at 18:36

JBGruber

11,727
1
23
45

3

If it's very similar, you should [flag it as a duplicate](https://stackoverflow.com/help/duplicates) – alistaire Nov 20 '17 at 18:39
allDest <- sort(unique(unlist(strsplit(as.character(dataFrame$ROUTE, "-"))))) Had to add as.character to make it work. It works, but the vector became too large for the full dataset. I'll keep playing with it to see if I can tweak something. Thank you! – versusChou Nov 20 '17 at 18:57

score 0 · Accepted Answer · answered Nov 20 '17 at 18:49

No need for the for loop. data.frames are just lists so we can assign extra elements all in one go:

cities <- unique(unlist(strsplit(df$ROUTE, "-")))
df[, cities] <- lapply(cities, function(x) as.numeric(grepl(x, df$ROUTE)))

#    ROUTE   OFF    ON LAX BWI SFO
#1 LAX-BWI 10:00 17:00   1   1   0
#2 LAX-SFO 11:00 13:00   1   0   1
#3 BWI-LAX 18:00 01:00   1   1   0
#4 BWI-SFO 15:00 20:00   0   1   1

The ROUTE column is easy enough to drop after the calculation if you don't want it

"Dummy" coding a factor that has two values in R

2 Answers2