As per impala
document here, impala
by default assumes UTC timezone for the data.
Because Impala does not assume that TIMESTAMP
values are in any
particular time zone, you must be conscious of the time zone aspects
of data that you query, insert, or convert.
For consistency with Unix system calls, the TIMESTAMP
returned by the
now()
function represents the local time in the system time zone,
rather than in UTC. To store values relative to the current time in a
portable way, convert any now()
return values using the
to_utc_timestamp()
function first.
When working with hive, you may want to follow what the document suggest, but please note that there is a performance overhead with the solution. To avoid the performance overhead, suggest you to save the hive date in UTC timezone (if possible)
If you have data files written by Hive, those TIMESTAMP
values
represent the local timezone of the host where the data was written,
potentially leading to inconsistent results when processed by Impala
.
To avoid compatibility problems or having to code workarounds, you can
specify one or both of these impalad startup flags:
-use_local_tz_for_unix_timestamp_conversions=true
-convert_legacy_hive_parquet_utc_timestamps=true
Although -convert_legacy_hive_parquet_utc_timestamps
is turned off by default to avoid performance overhead, where practical turn it on when
processing TIMESTAMP
columns in Parquet files written by Hive, to
avoid unexpected behavior.