1

I have a column of dates, I want to extract the year, month and day into separate columns. Unfortunately there are inconsistent entries in the dates column, so the normal solution of using format(as.Date(),"%Y") or lubridate::year() doesn't work.

Here is an example dataframe:

dates_df <- data.frame(dates = c("1985-03-23", "", "1983", "1984-01"))

And here is the desired result:

       dates year month  day
1 1985-03-23 1985     3   23
2            <NA>  <NA> <NA>
3       1983 1983  <NA> <NA>
4    1984-01 1984     1 <NA>

I can achieve the desired result with the following code, but it is very slow on large datasets (>100,000 rows):

dates_df$year <- sapply(dates_df$dates, function(x) unlist(strsplit(x, "\\-"))[1])
dates_df$month <- sapply(dates_df$dates, function(x) unlist(strsplit(x, "\\-"))[2])
dates_df$day <- sapply(dates_df$dates, function(x) unlist(strsplit(x, "\\-"))[3])

My question:

Is there a more efficient (fast) way to extract year, month, day columns from messy date data?

flee
  • 1,253
  • 3
  • 17
  • 34

2 Answers2

2

Using strsplit and adapting the lengths.

cbind(dates_df, t(sapply(strsplit(dates_df$dates, '-'), `length<-`, 3)))
#        dates    1    2    3
# 1 1985-03-23 1985   03   23
# 2            <NA> <NA> <NA>
# 3       1983 1983 <NA> <NA>
# 4    1984-01 1984   01 <NA>

With nice names:

cbind(dates_df, `colnames<-`(
  t(sapply(strsplit(dates_df$dates, '-'), `length<-`, 3)), c('year', 'month', 'day')))
#        dates year month  day
# 1 1985-03-23 1985    03   23
# 2            <NA>  <NA> <NA>
# 3       1983 1983  <NA> <NA>
# 4    1984-01 1984    01 <NA>
jay.sf
  • 60,139
  • 8
  • 53
  • 110
2

My first thought would have been to try tidyr::separate. Untested for speed and might break down if there are date formats not represented in the example data.

tidyr::separate(dates_df, 
                dates, 
                into = c('year', 'month', 'day'), 
                remove = FALSE)

#-----
       dates year month  day
1 1985-03-23 1985    03   23
2                  <NA> <NA>
3       1983 1983  <NA> <NA>
4    1984-01 1984    01 <NA>
nniloc
  • 4,128
  • 2
  • 11
  • 22