0

I'm working with job application data where each prior job held is a row in an excel file. I want to transform the data set such that there are columns for each past employer 1,2,3,4 etc...

I think the problem is best explained with an example. How do I get from the start data frame to the desired data frame?

I've tried some melting and casting but I'm getting stuck because I don't want a column created for each unique company name, but rather based on the number of unique company names.

id <- c(1000,1000,1002,1007,1007,1007,1007,1009) 
employers <-c("Ikea","Subway","DISH","DISH","Ikea","Starbucks","Google","Google")
start_date <- c("2/1/2013","5/1/2000","4/1/2012","3/1/2014","8/15/2011","4/15/2008","2/1/2004","3/15/2010")
start <- data.frame(cbind(id,employers,start_date))
colnames(start) <- c("id","employers","start_date")

start

unique_id <- c(1000,1002,1007,1009)
emp1 <- c("Ikea","DISH","DISH","Google")
emp2 <- c("Subway",NA,"Ikea",NA)
emp3 <- c(NA,NA,"Starbucks",NA)
emp4 <- c(NA, NA,"Google",NA)
emp1_start <- c("2/1/2013","4/1/2012","3/1/2014","3/15/2010")
emp2_start <- c("5/1/2000",NA,"8/15/2011",NA)
emp3_start <- c(NA,NA,"4/15/2008",NA)
emp4_start <- c(NA,NA,"2/1/2004",NA)
desired <- data.frame(cbind(unique_id,emp1,emp2,emp3,emp4,emp1_start,emp2_start,emp3_start,emp4_start))

desired
andrea
  • 117
  • 10
  • `start$time <- with(start, ave(as.character(id),id,FUN=seq_along) ); reshape(start, direction="wide", idvar="id", sep="")` from the other answer. – thelatemail Aug 17 '16 at 05:14
  • You forgot to rename the columns :-) (just kidding ... your codegolf beats mine handily). – r2evans Aug 17 '16 at 05:24
  • Thanks @thelatemail for spotting the duplicate and also posting an answer using my example. Creating the timevar as expected by reshape worked perfectly on my actual data with is much bigger and messier. – andrea Aug 17 '16 at 15:35

1 Answers1

0

Using your data (intentionally with factors, easy enough to fix with stringsAsFactors = FALSE):

start <- data.frame(
          id=c(     "1000",      "1000",      "1002",      "1007",
                    "1007",      "1007",      "1007",      "1009" ),
   employers=c(     "Ikea",    "Subway",      "DISH",      "DISH",
                    "Ikea", "Starbucks",    "Google",    "Google" ),
  start_date=c( "2/1/2013",  "5/1/2000",  "4/1/2012",  "3/1/2014",
               "8/15/2011", "4/15/2008",  "2/1/2004", "3/15/2010" )
)

Will this work for you?

library(dplyr)
library(tidyr)

a <- start %>%
  select(-start_date) %>%
  group_by(id) %>%
  mutate(emp = sprintf("emp%s", seq_len(n()))) %>%
  ungroup() %>%
  spread(emp, employers)

b <- start %>%
  select(-employers) %>%
  group_by(id) %>%
  mutate(emp = sprintf("emp%s_start", seq_len(n()))) %>%
  ungroup() %>%
  spread(emp, start_date)

left_join(a, b, by = "id")
# # A tibble: 4 x 9
#       id   emp1   emp2      emp3   emp4 emp1_start emp2_start emp3_start emp4_start
#   <fctr> <fctr> <fctr>    <fctr> <fctr>     <fctr>     <fctr>     <fctr>     <fctr>
# 1   1000   Ikea Subway        NA     NA   2/1/2013   5/1/2000         NA         NA
# 2   1002   DISH     NA        NA     NA   4/1/2012         NA         NA         NA
# 3   1007   DISH   Ikea Starbucks Google   3/1/2014  8/15/2011  4/15/2008   2/1/2004
# 4   1009 Google     NA        NA     NA  3/15/2010         NA         NA         NA
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks @r2evens. I'm going to hang onto this for the future. It worked great on my simple example but was a bit cumbersome on the actual data which also has multiple rows for past schools and associated dates, GPAs etc so the select() portion wasn't straightforward. – andrea Aug 17 '16 at 15:38