6

Suppose I have an R data.table (data.frame) with a column of strings of time stamps, i.e.:

"2018-10-12 12:35:37.004226", 
"2018-10-12 12:35:38.00332", 
"2018-10-12 12:35:39.000999",...

What is the best data type on which to carry out a rolling join? Shall I convert this to unix epoch time (type numeric), or keep the column as is (type character)?

My primary objective is accurate and identical output across different systems. The initial thought was that since numerics are not represented exactly, this might yield a different ordering of the timestamps (and thus different join results across different systems.)

I've now come to understand that while R does not represent a numeric exactly, it does seem to map each number to a unique representation across systems.

That is,

sprintf("%.60f",as.numeric(as.POSIXct("2018-10-12 12:35:37.004226",tz="GMT")))

produces this representation

[1] "1539347737.004225969314575195312500000000000000000000000000000000000000"

across four different systems to which I have access (taken from Double precision (64-bit) representation of numeric value in R (sign, exponent, significand)).

I think it is thus safe to say that as long as one's timestamps do not require an accuracy of more than R's internal numerical representation (of 50+ digits after the decimal point!!), all rolling joins should match across systems as there is no chance for the representation to alter the strict ordering of the timestamps (and to thus yield different join results across different systems.)

So for rolling joins on time stamps, convert to numeric, carry out your join, and don't look back.

Is there agreement on this?

  • If you have many distinct values, numeric will be much smaller than character. (Repeating character values are stored only once, with a pointer to an entry in the R "string cache"). Try `object.size(paste("2018-10-12 12:35:39", 0:1e4)); object.size(0:1e4)` Also since your character strings omit trailing zeros... are you sure rolling joins will behave as they should? I guess order should be retained with numeric despite messiness from the floating-point representation. – Frank Oct 15 '18 at 20:02
  • _accurate and identical output across different systems_ - for that you have to use integers instead of floating point. Making rolling joins on that will be tricky, but possible. Depending on precision you might need to split field into two integer fields (IDate, ITime) and perform 2+ rolling joins. You might need to shift values to detect rolling matches of `2000-01-01 23:59:59.999` to `2000-01-02 00:00:00.000`. It will carry overhead. The other scalable alternative is to code that in C/C++ using int64. – jangorecki Apr 22 '19 at 16:09

0 Answers0