1

I am facing a very weird issue in hive in production environment(cloudera 5.5) which is basically not reproducible in my local server(Don't know why) i.e. for some records I am having wrong timestamp value while inserting from temp table to main table as String "2017-10-21 23" is converted into timestamp "2017-10-21 23:00:00" datatype while insertion.

example::

2017-10-21 23 -> 2017-10-21 22:00:00
2017-10-22 15 -> 2017-10-22 14:00:00

It is happening very very infrequent. Means delta value is about 1% of the data.

Flow::: Data in temp table(External table) is populated hourly by using oozie. Below insert statement is executed hourly to insert from temp table to main table(internal table in ORC) in hive using Oozie workflow.

Flow summary::: Linux logs >> copy logs in temp table(external hive table) >> insert in main hive table.

Insert from temp table to main table:::

FROM
 temp
 INSERT INTO TABLE
 main
 PARTITION(event_date,service_id)
 SELECT
 from_unixtime(unix_timestamp(event_timestamp ,'yyyy-MM-dd HH'), 'yyyy-MM-dd HH:00:00.0'),
 col3,
 col4,
 "" as partner_nw_name,
 col5,
 "" as country_name,
 col6,
 col7,
 col8,
 col9,
 col10,
 col11,
 col12,
 col13,
 col14,
 col15,
 kpi_id,
 col18,
 col19,
 col20,
 col21,
 col23,
 col24,
 col25,
 from_unixtime(unix_timestamp(event_timestamp ,'yyyy-MM-dd HH'), 'yyyy-MM-dd') as event_date,
 service_id;

Temp Table:::

hive> desc temp;
OK
event_timestamp string
col2 int
col3 int
col4 int
col5 int
col6 string
col7 string
col8 string
col9 string
col10 string
col11 int
col12 int
col13 string
col14 string
col15 string
service_id int
kpi_id int
col18 bigint
col19 bigint
col20 bigint
col21 bigint
col22 double
col23 string
col24 int
col25 int
Time taken: 0.165 seconds, Fetched: 25 row(s)

Main Table:::

hive> desc main;
OK
event_timestamp timestamp
col3 int
col4 int
partner_nw_name string
col5 int
country_name string
col6 string
col7 string
col8 string
col9 string
col10 int
col11 int
col12 int
col13 string
col14 string
col15 string
kpi_id int
col18 bigint
col19 bigint
col20 bigint
col21 bigint
col23 double
col24 int
col25 int
event_date date
service_id int

# Partition Information
# col_name data_type comment

event_date date
service_id int
Time taken: 0.175 seconds, Fetched: 32 row(s)
Harneet Singh
  • 2,328
  • 3
  • 16
  • 23

2 Answers2

0

Seems like you are adding extra 00 for the hrs place too..

try this:

select from_unixtime(unix_timestamp('2017-08-29 05','yyyy-MM-dd HH'),'yyyy-MM-dd HH:00:0');

the above query gives:

2017-10-21 23:00:0

is this what you are expecting? you can add 'yyyy-MM-dd HH:00:00.0' if needed.

sk7979
  • 140
  • 2
  • 18
  • I am expecting `2017-10-21 23 -> 2017-10-21 23:00:00`. It gives right answer at most. But sometimes it changes the values of hour as show in example. – Harneet Singh Nov 06 '17 at 12:55
  • you are saying that the source is a string right .. String "2017-10-21 23" .. there might be some extra characters getting appended tot he string.. for example "('2017-10-21 -23','yyyy-MM-dd HH')" gives "2017-10-20 01:00:00.0".. please do check the data through hue or copy the file to local and check. – sk7979 Nov 06 '17 at 13:27
  • I Checked that data. The data was right and then I executed the flow on local with same data and in local it is giving me right output. So that's why am unable to troubleshoot the issue. – Harneet Singh Nov 06 '17 at 13:35
  • How many nodes are there in the cluster? For me it sounds like related to a specific data nodes which might not have the same timezone/settings as all others. Can you check if there are all in sync? And if the issue related to code execution on a specific node? – U880D Jan 02 '18 at 16:48
0

If you are writing your data in parquet format using Hive then hive adjust the timestamp by local timezone offset. For more information please go through the below links.

There is a Jira ticket related to that for Impala #2716

Cloudera Impala Timestamp document is here

chandu kavar
  • 411
  • 1
  • 4
  • 13
  • That was very nice. But my problem is different. I am saying sometimes it shows different time not always. It is inconsistent behaviour as I said `It is happening very very infrequent. Means delta value is about 1% of the data.` in my question. And yes, I am using ORC file format and using only hive queries not impala. – Harneet Singh Nov 06 '17 at 18:17