0

I have a scenario where we have a timestamp field in the source file and getting loaded to the archive in text format and in the managed table in parquet format. The difference we see is that the value between these two tables is off by 5 hours.

Any idea if this is something you guys cane across or is this a big with Cloudera?

Please let me know.

roh
  • 1,033
  • 1
  • 11
  • 19
Eresh
  • 41
  • 5
  • Please elaborate your questions a bit more. This is insufficient information to debug your problem. Most probably it's a timezone issue but more details about the issue will help to rectify it accurately. – Rahul Feb 23 '18 at 14:55
  • @Eresh Kindly check my answer and upvote/accept, if it helps. – Gyanendra Dwivedi Mar 06 '18 at 21:34
  • I want to correct something above. "Impala will add 5 hours to the timestamp, it will treat as a local time for impala." That is incorrect. Impala does nothing to the timestamp it reads it as written. The problem is that HIVE always wants to convert the TS to UTC based on the local TZ of the machine. Its HIVE that's changing the value not Impala. Nor does Impala "assume UTC" impala simply reads the value as written. For example if you write a TS with a time 08-24-2018 11:16:00 HIVE assumes that local timezone based on the machine, and then converts it to UTC and writes it. Impala just writes ( – John Howey Aug 24 '18 at 15:24

2 Answers2

1

As per impala document here, impala by default assumes UTC timezone for the data.

Because Impala does not assume that TIMESTAMP values are in any particular time zone, you must be conscious of the time zone aspects of data that you query, insert, or convert.

For consistency with Unix system calls, the TIMESTAMP returned by the now() function represents the local time in the system time zone, rather than in UTC. To store values relative to the current time in a portable way, convert any now() return values using the to_utc_timestamp() function first.

When working with hive, you may want to follow what the document suggest, but please note that there is a performance overhead with the solution. To avoid the performance overhead, suggest you to save the hive date in UTC timezone (if possible)

If you have data files written by Hive, those TIMESTAMP values represent the local timezone of the host where the data was written, potentially leading to inconsistent results when processed by Impala. To avoid compatibility problems or having to code workarounds, you can specify one or both of these impalad startup flags:

-use_local_tz_for_unix_timestamp_conversions=true

-convert_legacy_hive_parquet_utc_timestamps=true

Although -convert_legacy_hive_parquet_utc_timestamps is turned off by default to avoid performance overhead, where practical turn it on when processing TIMESTAMP columns in Parquet files written by Hive, to avoid unexpected behavior.

Gyanendra Dwivedi
  • 5,511
  • 2
  • 27
  • 53
0

This is a expected behaviour when you use the timestamp in the hive, you have to set convert_legacy_hive_parquet_utc_timestamps globally.

Impala will add 5 hours to the timestamp, it will treat as a local time for impala. The easiest solution is to change the field type to string or subtract 5 hours while you are inserting in the hive.

There are a lot of questions on this already, check out.

roh
  • 1,033
  • 1
  • 11
  • 19