14

I would like to read a file with fread from data.table that has a column with "YYYY-MM-DD" format dates. By default, fread reads the column as chr. However, I would like to have the column as Date, the same way I would obtain when applying as.Date.

I have tried to use

dt[,starttime.date := as.Date(starttime.date)]

but it takes very long to run (I have approx. 43 million rows).

paljenczy
  • 4,779
  • 8
  • 33
  • 46
  • 1
    Have a look at `?fread`, the second paragraph under Description. – Arun Mar 19 '15 at 09:10
  • 1
    @Arun Would `as.IDate` be faster than `as.Date`? – Roland Mar 19 '15 at 09:14
  • 1
    Related: http://stackoverflow.com/q/12786335/1412059 – Roland Mar 19 '15 at 09:15
  • 1
    Thank you. `as.IDate` is not faster, takes the same time as `as.Date`. – paljenczy Mar 19 '15 at 09:31
  • 2
    I find that adding a format string in as.Date usually helps, as in `as.Date(d, format='%Y-%m-%d')`. Another thing you could try, since you have a large number of rows and presumably a smaller set of unique dates, is to group by the date column and do the conversion per group. – ytsaig Mar 19 '15 at 12:28
  • @Roland, it seems to just wrap `as.Date(...)`... :-( – Arun Mar 19 '15 at 17:42
  • @Arun: thanks for pointing that out. read.csv tried to get cute on me with my datetimes (and numeric), so I was reading a subset of data, looking up the class of the columns, and forcing colClasses to `character` for the offending columns, then doing a "real" read.csv. However, if I had any data lines that were corrupt/different than the test set I was getting an error (not warning) e.g.: `scan() expected 'an integer', got 'jR156037'`. It's a relief I can avoid colClasses in fread. I get only a warning, e.g.: `"Bumped column 3 to type character on data row 285, field contains 'jR156037'"` – mpag Mar 08 '18 at 21:06

1 Answers1

17

Using the fasttime package, as suggested in the fread documentation, is approximately 100x faster than as.Date or as.IDate:

library(data.table)
library(fasttime)

dt[,starttime.date := fastPOSIXct(starttime.date)]

Benchmark results:

library(microbenchmark)
library(fasttime)
DT <- data.table(start_date = paste(sample(1900:2018, 100000, replace = T), 
                                    sample(1:12, 100000, replace = T),
                                    sample(1:28, 100000, replace = T),
                                    sep = "-"))
microbenchmark(
  as.Date(DT$start_date),
  as.IDate(DT$start_date),
  fastPOSIXct(DT$start_date)
)

> Unit: milliseconds
>                        expr    mean 
>      as.Date(DT$start_date)  383.89
>     as.IDate(DT$start_date)  405.89
>  fastPOSIXct(DT$start_date)    4.59 
Mark Egge
  • 638
  • 8
  • 9