Hive stores timestamps in milliseconds since the Unix epoch. The Hive docs on timestamps are actually wrong that it is "timezoneless", as the Unix Epoch is by definition in UTC.
The timestamp you gave (1427673600000
) does indeed correspond to 2015-03-30 00:00:00 UTC
. If this is the instant in time that you intended to store, then you are doing it correctly. If actually you were intending to just store a calendar date (referring to the whole date, rather than midnight UTC on that date), then you should probably use the DATE
type instead, storing just 2015-03-30
.
You asked why cast(ts as timestamp)
gives 2015-03-30 02:00:00
. What's probably happening here, is that the time is being exported with the original timestamp, but when you receive it, it is getting loaded into a type that shows you the local time equivalent.
For example, this can happen with java.util.Date
. You should be able to take this value and interpret it differently, either with java.util.Calendar
, or Joda-Time, or the new Java 8 java.time
classes. If you're not using Java, then a similar approach would likely still apply. The point is, you're probably using Hive correctly, but the local time zone is being introduced as you view the results.
Do I need to always store timestamps as UTC format in Hive ...
Yes, that's the best practice, and that's what you're doing already.
... so in this case I need to subtract 2 hours from what I got ...
No, you should never manually add or subtract time from the timestamp. Doing so will put you at an entirely different point in time.
... and then I have to apply the current timezone during querying (using from_utc_timestamp
) ?
I'm not completely familiar with Hive. Looking at the docs for from_utc_timestamp
, it would appear this expects the input to already be in a timestamp, but they show an example using a string. Perhaps it will take an integer also, but then you would just pass UTC
as the time zone, essentially doing nothing in the value conversion. It's likely that you would still have the same problem, especially if the problem is on the receiving end. IMHO, I don't think you should have to use this.