4

I know this is a long-standing, deeply embedded issue, but it's something I come up against so regularly, and that I see beginners to R struggle with so regularly, that I'd love to have a satisfactory solution. My google and SO searches have come up empty so far, but please point me in the right direction if this is duplicated elsewhere.

TL;DR: Is there a way to use something like the POSIXct class without a timezone? I generally use tz="UTC" regardless of the actual timezone of the dataset, but it's a messy hack IMO, and I don't particularly like it. What I want is something like tz=NULL, which would behave the same way as UTC, but without actually adding "UTC" as a tzone attribute.


The problem

I'll start with an example (there are plenty) of typical timezone issues. Creating an object with POSIXct values:

df <- data.frame( timestamp = as.POSIXct( c( "2018-01-01 03:00:00",
                                             "2018-01-01 12:00:00" ) ),
                  a = 1:2 )
df

#             timestamp a
# 1 2018-01-01 03:00:00 1
# 2 2018-01-01 12:00:00 2

That's all fine, but then I try to convert the timestamps to dates:

df$date <- as.Date( df$timestamp )
df

#             timestamp a       date
# 1 2018-01-01 03:00:00 1 2017-12-31
# 2 2018-01-01 12:00:00 2 2018-01-01

The dates have converted incorrectly, because my computer locale is in Australian Eastern Time, meaning that the numeric values of the timestamps have been shifted by the offset relevant to my locale (in this case -11hrs). We can see this by forcing the timezone to UTC, then comparing the values before and after:

df$timestamp[1]
# [1] "2018-01-01 03:00:00 AEDT"

x <- lubridate::force_tz( df$timestamp[1], "UTC" ); x
# [1] "2018-01-01 03:00:00 UTC"

difftime( df$timestamp[1], x )
# Time difference of -11 hours

That's just one example of the issues cause by timezones. There are others, but I won't go into them here.


My hack-y solution

I don't want that behaviour, so I need to convince as.POSIXct not to mess with my timestamps. I generally do this by using tz="UTC", which works fine, except that I'm adding information to the data that isn't real. These times are NOT in UTC, I'm just saying that to avoid time-shift issues. It's a hack, and any time I give my data to someone else, they could be forgiven for thinking that the timestamps are in UTC when they're not. To avoid this, I generally add the actual timezone to the object/column name, and hope that anyone I pass my data on to will understand why someone would label an object with a timezone different to the one in the object itself:

df <- data.frame( timestamp.AET = as.POSIXct( c( "2018-01-01 03:00:00",
                                                 "2018-01-01 12:00:00" ),
                                              tz = "UTC" ),
                  a = 1:2 )
df$date <- as.Date( df$timestamp )
df

#         timestamp.AET a       date
# 1 2018-01-01 03:00:00 1 2018-01-01
# 2 2018-01-01 12:00:00 2 2018-01-01

What I'm hoping for

What I really want is a way to use POSIXct without having to specify a timezone. I don't want the times messed with in any way. Do everything as though the values were in UTC, and leave any timezone details like offsets, daylight savings, etc to the user. Just don't pretend they actually ARE in UTC. Here's my ideal:

x <- as.POSIXct( "2018-01-01 03:00:00" ); x
# [1] "2018-01-01 03:00:00"

attr( x, "tzone" )
# [1] NULL

shifted <- lubridate::force_tz( x, "UTC" )
shifted == x
# [1] TRUE

as.numeric( shifted ) == as.numeric( x )
# [1] TRUE

as.Date( x )
# [1] "2018-01-01"

So there's no timezone attribute on the object at all. The date conversion works as one would expect from the printed value. If there are daylight savings time-shifts, or any other locale-specific issues, the user (me or someone else) needs to deal with that themselves.

I believe something similar to this is possible in POSIXlt, but I really don't want to shift to that. chron or another timeseries-oriented package might be another solution, but I think POSIXct is more widely used and accepted, and this seems like something that should be possible within base::. A POSIXct object with tz="UTC" is exactly what I need, I just don't want to have to lie about timezones in order to get it to behave the way I want (and I believe most beginners to R expect).

So what do others do here? Is there an easy way to use POSIXct without a timezone that I've missed? Is there a better work-around than tz="UTC"? Is that what others are doing?

rosscova
  • 5,430
  • 1
  • 22
  • 35
  • I don't have a resolution for you, nor would I use it if/when somebody is able to do so. So many problems I've run into in database work are based on incorrect time/date management (parsing, insertion, etc). And removing the TZ from the datatype was critical. That's not to say that I don't have problems, but at least I have a lot more information to isolate *where* the problem occurs (based on bad TZ, you'd be surprised what can be correlated/troubleshot!). So though I feel your pain/frustration, I can only recommend against removal of TZ from the data type. Strongly. From lessons learned. – r2evans Jul 05 '18 at 23:54
  • 1
    @r2evans fair enough, but for now at least I disagree. I think of time-zones like units. A temperature can be stored as `C`, `F`, `K`... and we don't ask R to shift/adjust the values in any way. The user is trusted to record/know what the units are and how they need to be treated, and (most importantly) R behaves logically with the numeric values. If I combine datasets with temperatures in different units, I need to deal with that, and R stays out of the way regardless of my locale. Could timestamps be the same? – rosscova Jul 06 '18 at 00:09
  • 2
    I hear what you are saying, but I also have disparate locations for both users and data acquisition. When a query returns 4pm, is that relative to where the data was acquired, where it is stored, or to the user making the query? I understand that *"the user is trusted to record/know what the units are"*, but history has shown (to me) that the user really can never be trusted (rarely due to malice).Temperatures have units, they are \*F and \*C. Timestamps have units, they are UTC, Asia/Tokyo (or JST), and such. Good luck with your argument! I don't have to agree to appreciate it! – r2evans Jul 06 '18 at 00:25
  • Thanks @r2evans, I certainly see your points. FWIW, I always have 2 timestamps recorded for data acquisition (in addition to any user input), a `timestamp_client` (set by the user's device) and a `timestamp_server` (set by the server on submission) basically for the reasons you mention. – rosscova Jul 06 '18 at 00:40
  • There is no such thing as a datetime without a time zone. Thus, the POSIX standard includes a time zone in datetime objects. Using UTC has always been good enough for me. You can always turn your data into UTC by adding or subtracting the appropriate offset if you want to be super-correct but usually I don't care and just specify UTC to avoid DST issues with non-DST data. – Roland Jul 06 '18 at 07:54
  • Hi @Roland. Yeah, that's pretty much my point. There is no such thing, but it's very often useful, so I think it should be possible NOT to specify one. The fact that most people do what you and I do (specify UTC even when it's not true to avoid things like DST issues) shows that use-case pretty clearly I think. – rosscova Jul 07 '18 at 01:41
  • I don't think you understand. There is not only no implementation but even the concept of datetime without a timezone doesn't exist. – Roland Jul 07 '18 at 11:27
  • @Roland I do understand. Likewise, a numeric value doesn't represent an actual temperature/distance/volume/... without a unit, but we use them all the time, we don't expect a programming language to forcefully manipulate those numeric values because of that, and would be rightfully frustrated if those conversions were hidden from printed results and were locale dependant. You and I both routinely "lie" about the timezone of our timestamps to say to R "nothing to see here, leave these alone please". I think there should be a way to say that without the "UTC" lie. `tz=NULL` would work fine. – rosscova Jul 07 '18 at 22:33

2 Answers2

5

I'm not sure I understand your issue. Having (re-)read your post and ensuing comments, I see your point.

To summarise:

as.POSIXct determines tz from your system. as.Date has default tz = "UTC" for class POSIXct. So unless you're in tz = "UTC", dates may change; the solution is to use tz with Date, or to change the behaviour of as.Date.POSIXct (see update below).

Case 1

If you don't specify an explicit tz with as.POSIXct, you can simply specify tz = "" with as.Date to enforce a system-specific timezone.

df <- data.frame(
    timestamp = as.POSIXct(c("2018-01-01 03:00:00", "2018-01-01 12:00:00")),
    a = 1:2)

df$date <- as.Date(df$timestamp, tz = "")
df;
#           timestamp a       date
#1 2018-01-01 03:00:00 1 2018-01-01
#2 2018-01-01 12:00:00 2 2018-01-01

Case 2

If you do set an explicit tz with as.POSIXct, you can extract tz from the POSIXct object, and pass it on to as.Date

df <- data.frame(
    timestamp = as.POSIXct(c("2018-01-01 03:00:00", "2018-01-01 12:00:00"), tz = "UTC"),
    a = 1:2)

tz <- attr(df$timestamp, "tzone")
tz
#[1] "UTC"

df$date <- as.Date(df$timestamp, tz = tz)
df
#    timestamp a       date
#1 2018-01-01 03:00:00 1 2018-01-01
#2 2018-01-01 12:00:00 2 2018-01-01

Update

There exists a related discussion on Dirk Eddelbuettel's anytime GitHub project site. The discussion turns out somewhat circular, so I'm afraid it does not offer too much in terms of understanding why as.Date.POSIXct does not inherit tz from POSIXct. I would probably call this a base R idiosyncrasy (or as Dirk calls it: "[T]hese are known quirks in Base R").

As for a solution: I would change the behaviour of as.Date.POSIXct rather than the default behaviour of as.POSIXct.

We could simply redefine as.Date.POSIXct to inherit tz from the POSIXct object.

as.Date.POSIXct <- function(x) {
    as.Date(as.POSIXlt(x, tz = attr(x, "tzone")))
}

Then you get consistent results for your sample case:

df <- data.frame(
    timestamp = as.POSIXct(c("2018-01-01 03:00:00", "2018-01-01 12:00:00")),
    a = 1:2)
df$date <- as.Date(df$timestamp)
df
#timestamp a       date
#1 2018-01-01 03:00:00 1 2018-01-01
#2 2018-01-01 12:00:00 2 2018-01-01
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • 1
    (+1) It's really strange that `as.Date.POSIXct` has a fixed default time zone, rather than just using the time zone attribute already present in `POSIXct` vectors. That seems to be the root of these issues. Another workaround to add to the list (as the OP already references **lubridate**) would be to use `lubridate::as_date` (rather than `as.Date`): it _does_ use the attribute as the time zone in the conversion. – Mikko Marttila Jul 06 '18 at 18:38
  • Thank you for this, it's taught me a few things. It's interesting that the defaults for `tz` are different in `as.POSIXct` and `as.Date`. Any idea why that is, or is it just "because that's the way it's always been"? – rosscova Jul 07 '18 at 00:11
  • Thanks also to @MikkoMarttila. Interesting post. I do see your point @rosscova. Please take a look at my updated post; I think it would make more sense to change the default behaviour of `as.Date.POSIXct` rather than that of `as.POSIXct`. – Maurits Evers Jul 07 '18 at 04:49
  • 1
    @MauritsEvers thanks for the update. I think you're right about the argument being circular. I'll have a better read of the discussion on Dirk's `anytime` page. – rosscova Jul 07 '18 at 22:39
  • No worries @rosscova; can I suggest a change to your title to better reflect the issue and draw more attention to it? Perhaps something like "Timezone arguments in as.POSIXct and as.Date seem inconsistent"? – Maurits Evers Jul 08 '18 at 06:01
  • @MauritsEvers I don't mind changing the title, but my question is more broad than just the date conversion issue; that was just an example I used to highlight the kinds of issues caused. Daylight Savings is another common example, and there are others. If you've got a suggestion that covers the range of issues caused by the requirement of a timezone attribute, I'll be happy to update the title. – rosscova Jul 08 '18 at 07:16
  • @rosscova *"my question is more broad than just the date conversion issue"* But wasn't the inconsistent treatment of `tz` in `as.POSIXct` and `as.Date` the key issue? It appears to me that everything else (daylight savings etc.) is a result of this. Anyway, obviously it's your call. But I think rephrasing the title and mentioning `as.Date` would help future readers find (and potentially contribute to) your post:-) – Maurits Evers Jul 09 '18 at 00:51
  • @MauritsEvers fair call. Consistency between `as.POSIXct` and `as.Date` would definitely help, and would *mostly* solve the issue, yes. What I was really hoping for though was a way not to have to use `UTC` at all, unless I choose to, even if it's a default. My current workflow is to always specify `UTC`, which works. What that is though is me "lying" about my data to avoid an idiosyncrasy of the `POSIXct` class. That's messy and "hack-y" in my opinion, so I wish there was a better way. I think at the moment there's not, so I'll try to think of a better title similar to what you've suggested. – rosscova Jul 09 '18 at 01:48
3

You basically want a different default for as.POSIXct than what is provided. You don't really want to modify anything except as.POSIXct.default, which is the function that will eventually handle character values. It wouldn't make much sense to modify as.POSIXct.numeric since that will always be an offset to UCT. The tz argument only determines what format.POSIXct will display. So you can modify the formals list of the one you've been given. Put this in your .Rprofile:

 formals(as.POSIXct.default) <- alist(x=, ...=, tz="UTC")

Then it passes your tests:

> x <- as.POSIXct( "2018-01-01 03:00:00" ); x
[1] "2018-01-01 03:00:00 UTC"
> attr( x, "tzone" )
[1] "UTC"
> shifted <- lubridate::force_tz( x, "UTC" )
> shifted == x
[1] TRUE
> as.numeric( shifted ) == as.numeric( x )
[1] TRUE
> as.Date( x )
[1] "2018-01-01"

The alternative would be to define an entirely new class, but that would require much more extensive efforts.

A further point to make regards teh specification of time zones. With the prevalence of "daylight savings times" it might be more unambiguous during (input when possible) and output to use the %z format:

dtm <- format( Sys.time(), format="%Y-%m-%d %H:%M:%S %z")

#output
format( Sys.time(), format="%Y-%m-%d %H:%M:%S %z")
[1] "2018-07-06 17:18:27 -0700"

 #input and output without the formals change
 as.POSIXct(dtm, format="%Y-%m-%d %H:%M:%S %z")
[1] "2018-07-06 17:21:41 PDT"

 # after the formals change
  as.POSIXct(dtm, format="%Y-%m-%d %H:%M:%S %z")
 [1] "2018-07-07 00:21:41 UTC"

So when tz information is present as an offset, it can be handled correctly.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thanks @42-, this might be the best way for me to go. One of the things I was mentioning is that while setting `tz="UTC"` seems to be the best solution, it is what I'd call a "hack" and it makes me feel like I'm lying about my data to work-around R's expectations. I'd rather not to have to do that. I'll look into creating a new class if I have some time, but I suspect that might be a stretch. Thank you. – rosscova Jul 07 '18 at 00:09