9

I have a large data set with many columns containing dates in two different formats:

"1996-01-04" "1996-01-05" "1996-01-08" "1996-01-09" "1996-01-10" "1996-01-11"

and

"02/01/1996" "03/01/1996" "04/01/1996" "05/01/1996" "08/01/1996" "09/01/1996"

In both cases, the class() is "character". Since the data set has many rows (4.5 million), I am looking for an efficient data.table conversion method. Right now, I use this self-built function:

convert_to_date <- function(in_array){
  tmp <- try(as.Date(in_array, format = "%d/%m/%Y"),TRUE)
  if (all(!is.na(tmp)) & class(tmp) != "try-error"){
    return(tmp)
  } else{
    tmp2 <- try(as.Date(in_array),TRUE)
    if (all(!is.na(tmp2)) & class(tmp2) != "try-error"){
      return(tmp2)
    } else{
      return(in_array)
    }
  }
}

With which I then convert the columns (of data.table DF) that I need by

DF[,date:=convert_to_date(date)]

This is, however, still incredibly slow (nearly 45s per column).

Is there any way in optimising this via data.table methods? So far I have not found a better way, so I would be thankful for any tips.

P.S: For better readability, I have 'outsourced' the function to a second file and sourced it in my main routine. Does that have a (negative) significant impact on computation speed in R?

Uwe
  • 41,420
  • 11
  • 90
  • 134
Daedalus
  • 235
  • 1
  • 2
  • 9
  • 2
    `data.table` has `as.IDate()` – talat Sep 06 '17 at 14:45
  • Interesting question, but you have to post reproducible example for this one. – pogibas Sep 06 '17 at 14:46
  • 2
    [This answer](https://stackoverflow.com/a/46050286/3817004) contains a benchmark of various methods & packages to convert character dates to class `Date`. – Uwe Sep 06 '17 at 15:06
  • @docendodiscimus Thanks, but in my personal experience as.Date() and as.IDate() work comparably fast for data.table objects, I haven't run a benchmark test yet, but it did not 'feel' faster, so I was trying to hack my code somewhere else. – Daedalus Sep 06 '17 at 16:51
  • @PoGibas Sorry what am I missing here? I thought given the data, the function and the command I am using, my problem is fully (and quickly) reproducible? I will try to be more explicit next time if there was something missing. – Daedalus Sep 06 '17 at 16:51
  • 1
    Fyi, the speedup from IDate is not in the conversion; it's in everything that comes after that (filtering, joining). And generally using integers for this sort of thing will be faster than float. – Frank Sep 06 '17 at 17:29

4 Answers4

17

According to this benchmark, the fastest method to convert character dates in standard unambiguous format (YYYY-MM-DD) into class Date is to use as.Date(fasttime::fastPOSIXct()).

Unfortunately, this requires to test the format beforehand because your other format DD/MM/YYYY is misinterpreted by fasttime::fastPOSIXct().

So, if you don't want to bother about the format of each date column you may use the anytime::anydate() function:

# sample data
df <- data.frame(
    X1 = c("1996-01-04", "1996-01-05", "1996-01-08", "1996-01-09", "1996-01-10", "1996-01-11"), 
    X2 = c("02/01/1996", "03/01/1996", "04/01/1996", "05/01/1996", "08/01/1996", "09/01/1996"), 
    stringsAsFactors = FALSE)

library(data.table)
# convert date columns
date_cols <- c("X1", "X2")
setDT(df)[, (date_cols) := lapply(.SD, anytime::anydate), .SDcols = date_cols]
df
           X1         X2
1: 1996-01-04 1996-02-01
2: 1996-01-05 1996-03-01
3: 1996-01-08 1996-04-01
4: 1996-01-09 1996-05-01
5: 1996-01-10 1996-08-01
6: 1996-01-11 1996-09-01

The benchmark timings show that there is a trade off between the convenience offered by the anytime package and performance. So if speed is crucial, there is no other way to test the format of each column and to use the fastest conversion method available for the format.

The OP has used the try() function for this purpose. The solution below uses regular expressions to find all columns which match a given format (only row 1 is used to save time). This has the additional benefit that the names of the relevant columns are determined automatically and need not to be typed in.

# enhanced sample data with additional columns
df <- data.frame(
    X1 = c("1996-01-04", "1996-01-05", "1996-01-08", "1996-01-09", "1996-01-10", "1996-01-11"), 
    X2 = c("02/01/1996", "03/01/1996", "04/01/1996", "05/01/1996", "08/01/1996", "09/01/1996"), 
    X3 = "other data",
    X4 = 1:6,
    stringsAsFactors = FALSE)

library(data.table)
options(datatable.print.class = TRUE)

# coerce to data.table
setDT(df)[]
# convert date columns in standard unambiguous format YYYY-MM-DD
date_cols1 <- na.omit(names(df)[
  df[1, sapply(.SD, stringr::str_detect, pattern = "\\d{4}-\\d{2}-\\d{2}"),]])
# use fasttime package
df[, (date_cols1) := lapply(.SD, function(x) as.Date(fasttime::fastPOSIXct(x))), 
   .SDcols = date_cols1]
# convert date columns in DD/MM/YYYY format
date_cols2 <- na.omit(names(df)[
  df[1, sapply(.SD, stringr::str_detect, pattern = "\\d{2}/\\d{2}/\\d{4}"),]])
# use lubridate package
df[, (date_cols2) := lapply(.SD, lubridate::dmy), .SDcols = date_cols2]
df
           X1         X2         X3    X4
       <Date>     <Date>     <char> <int>
1: 1996-01-04 1996-01-02 other data     1
2: 1996-01-05 1996-01-03 other data     2
3: 1996-01-08 1996-01-04 other data     3
4: 1996-01-09 1996-01-05 other data     4
5: 1996-01-10 1996-01-08 other data     5
6: 1996-01-11 1996-01-09 other data     6

Caveat

In case one of the date columns does contain NA in the first row, this column may escape unconverted. To handle these cases, the above code needs to be amended.

Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • Thanks Uwe! This is exactly what I was looking for! I did not know about the str_detect/pattern combo, that looks way better than my 'try' approach. Fasttime was also new to me and looks very intriguing. So far I stayed away from Posix as I believed it always converts into a number whereas I wanted to keep the date strings for easier debugging, but I will see how your code handles it. I will implement it later tonight, but your code clearly does what I am after! Thanks a lot – Daedalus Sep 06 '17 at 16:43
  • Just one thing: `df[, (date_cols2) := lapply(.SD, lubridate::mdy), .SDcols = date_cols2]` produces the following error for me: **Warning message: All formats failed to parse. No formats found.** So since I am not familiar with the package I have switched back to `df[, (date_cols2) := lapply(.SD, as.Date,format = "%d/%m/%Y"), .SDcols = date_cols2]` But I would be interested if there is a solution in lubridate if you know what's wrong. The date format is DD/MM/YYYY. – Daedalus Sep 06 '17 at 19:30
  • 1
    My bad, I have picked the wrong date format `MM/DD/YYYY` instead of `DD/MM/YYYY` and accordingly the wrong `lubridate` function. Please, try `lubridate::dmy()`. (Note that the functions are named after the formats they support, e.g., `dmy(), mdy(), ymd()`, etc.) I've updated my answer accordingly. – Uwe Sep 06 '17 at 22:49
  • 3
    unfortunately `fasttime::fastPOSIXct("1969-05-05")` returns `NA` because it won't parse years before 1970 so it's another reason to use `anytime::anydate()`. – topchef Apr 11 '19 at 04:28
3

Your data

df <- data.frame(X1 = c("1996-01-04", "1996-01-05", "1996-01-08", "1996-01-09", "1996-01-10", "1996-01-11"), X2 = c("02/01/1996", "03/01/1996", "04/01/1996", "05/01/1996", "08/01/1996", "09/01/1996"), stringsAsFactors=F)

'data.frame':   6 obs. of  2 variables:
 $ X1: chr  "1996-01-04" "1996-01-05" "1996-01-08" "1996-01-09" ...
 $ X2: chr  "02/01/1996" "03/01/1996" "04/01/1996" "05/01/1996" ...

solution

library(dplyr)
library(lubridate)
ans <- df %>%
         mutate(X1 = ymd(X1), X2 = mdy(X2))

          X1         X2
1 1996-01-04 1996-02-01
2 1996-01-05 1996-03-01
3 1996-01-08 1996-04-01
4 1996-01-09 1996-05-01
5 1996-01-10 1996-08-01
6 1996-01-11 1996-09-01

str(ans)

'data.frame':   6 obs. of  2 variables:
 $ X1: Date, format: "1996-01-04" "1996-01-05" ...
 $ X2: Date, format: "1996-02-01" "1996-03-01" ...
CPak
  • 13,260
  • 3
  • 30
  • 48
  • 1
    Thanks CPak, this looks quite efficient too, but given the large scale computations I am doing I want to stick with data.table as much as possible. Very much appreciate your work nevertheless though! – Daedalus Sep 06 '17 at 16:44
1

Since you know beforehand there are only two date formats, this is easy. The format argument to as.Date is vectorized:

as_date_either <- function(x) {
    format_vec <- rep_len("%Y-%m-%d", length(x))
    format_vec[grep("/", x, fixed = TRUE)] <- "%m/%d/%Y"
    as.Date(x, format = format_vec)
}

Edited: replaced ifelse with subset assignment, which is faster

Nathan Werth
  • 5,093
  • 18
  • 25
0

If there are any duplicated date fields in your dataset, then one way you could do is by setting up de-duplicated reference table then do the mapping on the smaller dataset. This will be faster than converting the date fields on all records.

Data

df <- data.frame(
  X1 = c("1996-01-04", "1996-01-05", "1996-01-08", "1996-01-09", "1996-01-10", rep("1996-01-11", 100)), 
  X2 = c("02/01/1996", "03/01/1996", "04/01/1996", "05/01/1996", "08/01/1996", rep("09/01/1996", 100)), 
  stringsAsFactors = FALSE)

Create unique Date rows for mapping

date_mapping <- function(date_col){

  ref_df <- data.frame(date1 = unique(date_col), stringsAsFactors = FALSE)

  if(all(grepl("/", ref_df$date1))) {
    ref_df$date2 <- as.Date(ref_df$date1, format = "%d/%m/%Y")

  } else {
    ref_df$date2 <- as.Date(ref_df$date1)  
  }

  date_col_mapped <- ref_df[match(date_col, ref_df$date1), "date2"]

  return(date_col_mapped)

}


date_mapping(df$X1)
date_mapping(df$X2)
MKa
  • 2,248
  • 16
  • 22