2

I have a column variable in my data that is a time range: datesemployed. Sample data is below:

name datesemployed        university   
Kate Oct 2015 – Jan 2016  Princeton
Sue  Nov 2015 – Dec 2017  Pomona

I want to separate rows such that each year in the time range has its own row:

name datesemployed  university   
Kate  2015          Princeton
Kate  2016          Princeton
Sue   2015          Pomona
Sue   2016          Pomona
Sue   2017          Pomona

In a separate post, the solution was to separate rows based on -. But that solution only works if the time range is always one year. If the data were 2015-2017 for example, I would be missing a row for 2016.

The separate_rows syntax from the previous answer makes a lot of sense to me, but I'm not sure how to adapt it to this situation, or if it's even possible to do so. I've also found that with large frames of data (~1000 observations), separate_rows is really slow, and freezes R for me.

Yu Na
  • 112
  • 1
  • 18

1 Answers1

3

We could extract the 'year' part with str_extract_all, then do a sequence by looping over the list column and unnest the output

library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
    mutate(datesemployed = map(str_extract_all(datesemployed, 
          "\\d+"),  ~as.numeric(.x[1]):as.numeric(.x[2]))) %>% 
    unnest(c(datesemployed))
# A tibble: 5 x 3
#  name  datesemployed university
#  <chr>         <int> <chr>     
#1 Kate           2015 Princeton 
#2 Kate           2016 Princeton 
#3 Sue            2015 Pomona    
#4 Sue            2016 Pomona    
#5 Sue            2017 Pomona   

Or another option is to use separate_rows on the 'datesemployed', then with complete after grouping by 'name', 'university', we can expand the data

df1 %>%
    separate_rows(datesemployed, sep= "\\s*–\\s*") %>% 
    mutate(datesemployed = readr::parse_number(datesemployed)) %>% 
    group_by(name, university)  %>%
    complete(datesemployed =min(datesemployed):max(datesemployed))

Or in base R

lst1 <- lapply(gsub(".*\\s+(\\d+)\\D+(\\d+)", "\\1:\\2", 
      df1$datesemployed), function(x) eval(parse(text = x)))
transform(df1[rep(seq_len(nrow(df1)), lengths(lst1)), c("name", "university")],
         datesemployed = unlist(lst1))

data

df1 <- structure(list(name = c("Kate", "Sue"), datesemployed = 
    c("Oct 2015 – Jan 2016", 
"Nov 2015 – Dec 2017"), university = c("Princeton", "Pomona"
)), class = "data.frame", row.names = c(NA, -2L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    This was a really great answer. I used the ```separate_rows``` option. I couldn't get ```complete``` to work, as my computer kept freezing, but for others with the same issue, I found that modifying the second option with ```data.table``` worked fast. ```df2<-setDT(df1)[, .(years = min(years):max(years)), by = .(name, university)]``` – Yu Na May 12 '20 at 03:22