7

I have a dataset of 25 variables and over 2 million observations. One of my variables is a combination of a few different "categories" that I want to split to where it shows 1 category per column (similar to what split would do in stata). For example:

# Name      Age     Number               Events                      First 
# Karen      24        8         Triathlon/IM,Marathon,10k,5k         0
# Kurt       39        2         Half-Marathon,10k                    0 
# Leah       18        0                                              1

And I want it to look like:

# Name   Age  Number Events_1        Event_2      Events_3     Events_4      First
# Karen   24    8     Triathlon/IM    Marathon       10k         5k             0
# Kurt    39    2     Half-Marathon   10k            NA          NA             0 
# Leah    18    0     NA              NA             NA          NA             1

I have looked through stackoverflow but have not found anything that works (everything gives me an error of some sort). Any suggestions would be greatly appreciated.

Note: May not be important but the largest number of categories 1 person has is 19 therefore I would need to create Event_1:Event_19

Comment: Previous stack overflows have suggested the separate function, however this function does not seem to work with my dataset. When I input the function the program runs but when it is finished nothing is changed, there is no output, and no error code. When I tried to use other suggestions made in other threads I received error messages. However, I finally got it is work by using the cSplit function. Thank for the help!!!

Kfruge
  • 73
  • 1
  • 1
  • 4
  • possible duplicate of [Assigning results of strsplit to multiple columns of data frame](http://stackoverflow.com/questions/31357963/assigning-results-of-strsplit-to-multiple-columns-of-data-frame) – Pierre L Jul 23 '15 at 03:06
  • another possible duplicate: http://stackoverflow.com/q/18641951/2372064 – MrFlick Jul 23 '15 at 03:07
  • If you use `separate` from `tidyr` (answer in the duplicate question), just add `extra='drop'` to pad out the non-existing ones with NA. – mathematical.coffee Jul 23 '15 at 03:09
  • 2
    `cSplit` from my "splitstackshape" package should do what you're asking for. – A5C1D2H2I1M1N2O1R2T1 Jul 23 '15 at 03:18
  • For some reason the separate package does not work, an error does not show up but when the program stops running the dataset is unchanged and there is no output. – Kfruge Jul 23 '15 at 06:56
  • @AnandaMahto, cSplit is exactly what I needed. Thanks! – Kfruge Jul 23 '15 at 07:01

1 Answers1

22

From Ananda's splitstackshape package:

cSplit(df, "Events", sep=",")
#    Name Age Number First      Events_1 Events_2 Events_3 Events_4
#1: Karen  24      8     0  Triathlon/IM Marathon      10k       5k
#2:  Kurt  39      2     0 Half-Marathon      10k       NA       NA
#3: Leah   18      0     1            NA       NA       NA       NA

Or with tidyr:

separate(df, 'Events', paste("Events", 1:4, sep="_"), sep=",", extra="drop")
#   Name Age Number               Events_1 Events_2 Events_3 Events_4 First
#1 Karen  24      8           Triathlon/IM Marathon      10k       5k     0
#2  Kurt  39      2          Half-Marathon      10k     <NA>     <NA>     0
#3 Leah   18      0                     NA     <NA>     <NA>     <NA>     1

With the data.table package:

setDT(df)[,paste0("Events_", 1:4) := tstrsplit(Events, ",")][,-"Events", with=F]
#    Name Age Number First               Events_1 Events_2 Events_3 Events_4
#1: Karen  24      8     0           Triathlon/IM Marathon      10k       5k
#2:  Kurt  39      2     0          Half-Marathon      10k       NA       NA
#3: Leah   18      0     1                     NA       NA       NA       NA

Data

df <- structure(list(Name = structure(1:3, .Label = c("Karen", "Kurt", 
"Leah "), class = "factor"), Age = c(24L, 39L, 18L), Number = c(8L, 
2L, 0L), Events = structure(c(3L, 2L, 1L), .Label = c("               NA", 
"         Half-Marathon,10k", "     Triathlon/IM,Marathon,10k,5k"
), class = "factor"), First = c(0L, 0L, 1L)), .Names = c("Name", 
"Age", "Number", "Events", "First"), class = "data.frame", row.names = c(NA, 
-3L))
Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • 1
    Or `library(data.table); setDT(df)[, paste0('Events', 1:4):= tstrsplit(Events, ',')]` – akrun Jul 23 '15 at 04:19