2

I am reading a csv file using data.table package.

Sample dataset:

structure(list(Size = c(4886L, 4096L, 84848L, 518L, 264158L, 
725963L, 1340L, 75264L, 198724L), ModifiedTime = c("Jun 11, 2009 06:51:08 PM", 
"Aug 21, 2008 03:54:28 PM", "Feb 12, 2007 12:40:00 PM", "Aug 22, 2006 02:12:03 PM", 
"Dec 08, 2009 03:28:14 PM", "Sep 29, 2008 03:45:21 PM", "Sep 07, 2011 03:36:54 AM", 
"Jul 28, 2011 05:09:58 PM", "Jul 23, 2012 02:25:58 PM"), AccessTime = c("Mar 15, 2013 09:24:53 AM", 
"May 12, 2009 04:45:41 PM", "Apr 07, 2014 09:39:03 AM", "Dec 25, 2007 06:48:18 AM", 
"Apr 08, 2013 11:52:15 AM", "May 17, 2011 08:48:40 AM", "Mar 12, 2013 02:55:01 AM", 
"Jun 07, 2014 04:21:28 PM", "Jan 21, 2013 12:58:07 PM"), contentid = c("000000285b7925f511b3159a72f80a4a", 
"0000011afae4d1227c4df57b410ea52c", "000001cec02017ca3eb81ddc4cd1c9ff", 
"00000233565d1c17c3135a9504c455ca", "000003020ba74b9d1b6075d3c1b8fcb3", 
"0000034b98d29d84ce7b61ee68be7658", "000004ed899e26ae1c9b1ece35a98af1", 
"000005a09fd2eb706c5800eb06084160", "0000060b9d552c35f281b5033dcfa1b4"
)), row.names = c(NA, -9L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x00000232a9fd1ef0>)

While loading I can specify the type of column being read for example:

tble = fread("sample.csv", colClasses = c(Size="numeric", contentid="character"))

My question is:

  1. While loading itself is it possible to specify how to parse the date column e.g. I know I can convert the date column later with as.Date(sample$AccessTime, "%b %d, %Y %H:%M:%S %p") but can I specify this format while loading, so the column is read as datetime column instead of character?

Edit: The intention of specifying this parsing while loading itself, is that I am assuming that this would help the csv to load faster. (Not sure if this is right to assume)

PS: Note I have to use data.table because my csv file is very large ~ 5GB.

Ritchie Sacramento
  • 29,890
  • 4
  • 48
  • 56
monte
  • 1,482
  • 1
  • 10
  • 26
  • 2
    See https://stackoverflow.com/questions/13022299/specify-custom-date-format-for-colclasses-argument-in-read-table-read-csv. (btw, I don't imagine your assumption would be correct). – Ritchie Sacramento Dec 07 '20 at 12:38
  • Does anything in [this answer](https://stackoverflow.com/questions/13022299/specify-custom-date-format-for-colclasses-argument-in-read-table-read-csv) solve your problem? – Yerpo Dec 07 '20 at 12:41
  • And for faster string to date conversion consider the `anytime` package or see here https://stackoverflow.com/questions/14218297/convert-string-date-to-r-date-fast-for-all-dates – holzben Dec 07 '20 at 12:44
  • 1
    I agree with @27ϕ9 ... I do not expect any delimited-file-reader to operate *faster* or even *as fast* doing in-place conversion of string to `Date` or `POSIXt`; at best it will be slightly slower. I *would* expect that this in-place conversion to be about the same as "read file + `as.Date`", where the only advantage to doing it during read-in is for code-golf. – r2evans Dec 07 '20 at 12:57
  • @r2evans I feel like parsers could be faster. There's first the memory allocation (i.e., no need to allocate a character and then allocate a numeric vector) and then there's also the fact that `strptime()` seems slow which is why there are packages like [tag:fasttime]. With that said, the time it would take to make a parser to do the various formats would take... a lot of time to make. Related, [tag:data.table] does read some characters strings as IDate now but this POSIXct would not count. To the OP, I think the answer is no. – Cole Dec 09 '20 at 02:16
  • @r2evans what you're missing is the R global string cache. In fact when we added support for reading standard ISO8601 dates and times, we observed _huge_ increases in parsing performance -- we skip the step of adding each date's string to the global string cache and then removing it (which will likely trigger a `gc()` as well). Moreover, Because of the global string cache, it's not possible to parallelize over the table at the C level (because the R API (`SET_STRING_ELT` and `STRING_ELT`) is required & it's not thread-safe) – MichaelChirico Dec 09 '20 at 02:33
  • @monte we don't have plans to support anything but ISO8601 time formats in `fread` itself, unfortunately. I do recommend having a look at [this](https://github.com/Rdatatable/data.table/issues/2603) issue (with [this](https://github.com/Rdatatable/data.table/pull/3279) in-progress PR to implement it). Basically, you'll do much better by only converting unique timestamps, which you can then join back to the full table. – MichaelChirico Dec 09 '20 at 02:37
  • @Cole note that `fasttime` only works for ISO8601 time, which is why it's fast. `strptime` is slow because reading timestamps, as a completely general problem, is hard! Daylight savings, time zones, weekday abbreviations, ... no fun. I think it's parallelization that let us do better than `fasttime` (plus some minor optimizations on the reading itself), see the PR for benchmarks: https://github.com/Rdatatable/data.table/pull/4464 – MichaelChirico Dec 09 '20 at 02:42
  • @MichaelChirico yes, you found my weakness. As many times as I've read and referenced Adv-R, I've missed the global string hash/pool/cache (...). Interesting, that can explain so many things. And it means I need to retract my previous comment. Thanks to Cole and MichaelChirico for the thorough explanations. – r2evans Dec 09 '20 at 14:58

0 Answers0