1

I have a "messy" dataset that I'm struggling to convert to a tidy format. Here's what the data looks like:

make <- c("Honda", "Jeep", "Nissan", "Ford")
model <- c("Civic", "Wrangler", "Altima", "Focus")
year <- c(1996, 2000, 2005, 1988)
color <- c("red;green;blue", "red;blue", "purple;red;green;black", "yellow;white;blue")
car.df <- data.frame(make, model, year, color)

What I would like to do is to convert the data to tidy/long format by separating out the "color" field and then making each make/model/year/color a separate line. So the output would look something like this (shown just for the Honda and Jeep):

make.new <- c("Honda", "Honda", "Honda", "Jeep", "Jeep")
model.new <- c("Civic", "Civic", "Civic", "Wrangler", "Wrangler")
year.new <- c(1996, 1996, 1996, 2000, 2000)
color.new <- c("red", "green", "blue", "red", "blue")
car.df.new <- data.frame(make.new, model.new, year.new, color.new)

Any suggestions on how to do this? In the dataset, there can be many different colors so once you separate the color field into different columns, you would potentially have lots of different columns to tidy up (and therefore a different number of rows in the tidy data set at the end for each make/model/year).

Thanks for any helpful suggestions!

Steve

Steven
  • 239
  • 2
  • 10
  • `idx <- strsplit(as.character(car.df$color), ';'); car.df <- car.df[rep(1:nrow(car.df), lengths(idx)), ]; car.df$color <- unlist(idx); car.df` – rawr May 07 '16 at 15:19

1 Answers1

2

cSplit from splitstackshape can do this in a compact way. Specify the splitCols ("color"), sep (";") and the direction ("long"), and it will give the expected output.

library(splitstackshape)
cSplit(car.df, "color", ";", "long")
#      make    model year  color
# 1:  Honda    Civic 1996    red
# 2:  Honda    Civic 1996  green
# 3:  Honda    Civic 1996   blue
# 4:   Jeep Wrangler 2000    red
# 5:   Jeep Wrangler 2000   blue
# 6: Nissan   Altima 2005 purple
# 7: Nissan   Altima 2005    red
# 8: Nissan   Altima 2005  green
# 9: Nissan   Altima 2005  black
#10:   Ford    Focus 1988 yellow
#11:   Ford    Focus 1988  white
#12:   Ford    Focus 1988   blue

If we need a dplyr/tidyr solution

library(dplyr)
library(tidyr)
library(stringr)
separate(car.df, color, into = paste0("color", seq(max(str_count(color, 
           ";"))+1)), fill="right") %>% 
     gather(Var, color, - make, -model, -year) %>%
     select(-Var)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Awesome thank you! Just what I needed. I will keep that cSplit function in mind going forward. – Steven May 07 '16 at 14:11