1

If I query an RDBMS to get a timestamp I receive: 2015-03-30 00:00:00 After I export it to a Hive table as a bigint column I get 1427673600000 (cast(ts as timestamp) gives 2015-03-30 02:00:00) . I.e the current local timezone (with daylight savings) has been applied to the timestamp.

If I want to be consistent of what we have in the DB, how to store the exported timestamps in Hive? Do I need to always store timestamps as UTC format in Hive so in this case I need to substract 2 hours from what I got and then I have to apply the current timezone during querying (using from_utc_timestamp) ?
How can I take into consideration daylight savings (GMT+1 and GMT+2) ?

What is the best practice?

Bruckwald
  • 797
  • 8
  • 23
  • 1
    Past timestamps should always be stored as UTC. The exception is that if you also need *also* need to correlate local times, then you store the local time in a separate field, or use a `datetimeoffset` when the platform supports it. (Future times are an entirely different matter.) This is in the best practices, linked as duplicate. – Matt Johnson-Pint Oct 19 '15 at 16:19
  • Thanks for the info. Do you have any Hive related add-on / best practices? – Bruckwald Oct 19 '15 at 16:23
  • 1
    Re-opening to add details for Hive. (Originally marked as dup of [Daylight saving time and time zone best practices](http://stackoverflow.com/a/2532962/634824)) – Matt Johnson-Pint Oct 19 '15 at 16:31

1 Answers1

1

Hive stores timestamps in milliseconds since the Unix epoch. The Hive docs on timestamps are actually wrong that it is "timezoneless", as the Unix Epoch is by definition in UTC.

The timestamp you gave (1427673600000) does indeed correspond to 2015-03-30 00:00:00 UTC. If this is the instant in time that you intended to store, then you are doing it correctly. If actually you were intending to just store a calendar date (referring to the whole date, rather than midnight UTC on that date), then you should probably use the DATE type instead, storing just 2015-03-30.

You asked why cast(ts as timestamp) gives 2015-03-30 02:00:00. What's probably happening here, is that the time is being exported with the original timestamp, but when you receive it, it is getting loaded into a type that shows you the local time equivalent.

For example, this can happen with java.util.Date. You should be able to take this value and interpret it differently, either with java.util.Calendar, or Joda-Time, or the new Java 8 java.time classes. If you're not using Java, then a similar approach would likely still apply. The point is, you're probably using Hive correctly, but the local time zone is being introduced as you view the results.

Do I need to always store timestamps as UTC format in Hive ...

Yes, that's the best practice, and that's what you're doing already.

... so in this case I need to subtract 2 hours from what I got ...

No, you should never manually add or subtract time from the timestamp. Doing so will put you at an entirely different point in time.

... and then I have to apply the current timezone during querying (using from_utc_timestamp) ?

I'm not completely familiar with Hive. Looking at the docs for from_utc_timestamp, it would appear this expects the input to already be in a timestamp, but they show an example using a string. Perhaps it will take an integer also, but then you would just pass UTC as the time zone, essentially doing nothing in the value conversion. It's likely that you would still have the same problem, especially if the problem is on the receiving end. IMHO, I don't think you should have to use this.

Matt Johnson-Pint
  • 230,703
  • 74
  • 448
  • 575