Calculate time difference between two columns of string type in hive without changing the data type string

Question

I am trying to calculate the time difference between two columns of a row which are of string data type. If the time difference between them is less than 2 hours then select the first column of that row else if the time difference is greater than 2 hours then select the second column of that row. It can be done by converting the columns to datetime format, but I want the result to be in string only. How can I do that? The data looks like this:

col1(string type)
2018-07-16 02:23:00
2018-07-26 12:26:00
2018-07-26 15:32:00

col2(string type)
2018-07-16 02:36:00
2018-07-26 14:29:00
2018-07-27 15:38:00

leftjoin · Answer 1 · 2018-07-30T18:52:43.593

1

Use unix_timestamp() to convert string timestamp to seconds. The difference in hours will be:

hive> select  (unix_timestamp('2018-07-16 02:23:00')- unix_timestamp('2018-07-16 02:36:00'))/60/60;
OK
-0.21666666666666667

Important update: this method will work correctly only if time zone is configured as UTC. Because for DST timezones for some marginal cases Hive converts time during timestamp operations. Consider this example for PDT time zone:

hive> select hour('2018-03-11 02:00:00'); 
OK 
3

Note the hour is 3, not 2. This is because 2018-03-11 02:00:00 cannot exist in PDT time zone because exactly at 2018-03-11 02:00:00 time is adjusted and becomes 2018-03-11 03:00:00. The same happens when converting to unix_timestamp. For PDT time zone unix_timestamp('2018-03-11 03:00:00') and unix_timestamp('2018-03-11 02:00:00') will return the same timestamp:

hive> select unix_timestamp('2018-03-11 03:00:00');
OK
1520762400
hive> select unix_timestamp('2018-03-11 02:00:00');
OK
1520762400

And few links for your reference:

https://community.hortonworks.com/questions/82511/change-default-timezone-for-hive.html

http://boristyukin.com/watch-out-for-timezones-with-sqoop-hive-impala-and-spark-2/

Also have a look at this jira please: Hive should carry out timestamp computations in UTC

edited Jul 30 '18 at 18:52

answered Jul 30 '18 at 10:16

leftjoin

36,950
8
57
116

I've tried this method but this does not work correctly for all the time stamps. For example: SELECT (unix_timestamp('2018-03-11 03:52:00') - unix_timestamp('2018-03-11 02:00:00'))/60; for this it shows the incorrect result. it does not convert the hour part correctly. If you run this: select unix_timestamp('2018-03-11 03:00:00')/60/60, unix_timestamp('2018-03-11 02:00:00')/60/60; both the results are same. – Khushboo Jul 30 '18 at 10:41
@Khushboo Confirm, I have checked and observing the same – leftjoin Jul 30 '18 at 11:25
@Khushboo It seems only extraction of hours, minutes, seconds separately and converting date only to seconds works as expected. See updated answer please – leftjoin Jul 30 '18 at 11:59
Please check this, I tried it and it is still giving me time difference 0. it should give me 3600 seconds as the time difference. Here I have made the minute part as 00, so that the difference can be easily noticed. select ((unix_timestamp(to_date('2018-03-11 03:00:00'),'yyyy-MM-dd')+hour('2018-03-11 03:00:00')*60*60+minute('2018-03-11 03:00:00')*60+second('2018-03-11 03:00:00')) - (unix_timestamp(to_date('2018-03-11 02:00:00'),'yyyy-MM-dd')+hour('2018-03-11 02:00:00')*60*60+minute('2018-03-11 02:00:00')*60+second('2018-03-11 02:00:00')))/60/60; – Khushboo Jul 30 '18 at 12:38
@Khushboo The problem is this: hive> select hour('2018-03-11 02:00:00'); OK 3 But it should be 2!!! – leftjoin Jul 30 '18 at 13:20
@Khushboo Finally have found what is the reason: https://www.timeanddate.com/time/change/usa?year=2018 This hour does not exist in PST bacause it is time change and it becomes 03 hrs, skipping 02 hrs. So, it is converting non-existing time. – leftjoin Jul 30 '18 at 13:43
@Khushboo unix_timestamp using the default timezone and the default locale. If timezone is UTC it should work good. And the timezone is configurable on cluster. Tried this: https://community.hortonworks.com/questions/82511/change-default-timezone-for-hive.html Setting it in hive session does not work. – leftjoin Jul 30 '18 at 14:23
yes it is happening due to daylight saving time only. I also found something that is worth reading: http://boristyukin.com/watch-out-for-timezones-with-sqoop-hive-impala-and-spark-2/ – Khushboo Jul 30 '18 at 15:32
@Khushboo I decided to keep this just for reference with explanation what is happening. – leftjoin Jul 30 '18 at 18:21

score 1 · Accepted Answer · answered Jul 30 '18 at 10:53

1

I think you don't need to convert the columns to datetime format, since the data in your case is already ordered (yyyy-MM-dd hh:mm:ss). You just need to take all the digits and take it into one string (yyyyMMddhhmmss) then you can apply your selection which is bigger or smaller than 2 hours (here 20000 since the hour is followed by mmss). By looking at your example (assuming col2 > col1), this query would work:

SELECT case when regexp_replace(col2,'[^0-9]', '')-regexp_replace(col1,'[^0-9]', '') < 20000 then col1 else col2 end as col3 from your_table;

answered Jul 30 '18 at 10:53

Agung Sriwongo

133
6

Can you give me some scenario when it failed ? – Agung Sriwongo Jul 30 '18 at 11:23
till now the solution that you have given is working fine for all the scenarios that i have tested. Thanks :). I''l let you know if i come across any test case which fails – Khushboo Jul 30 '18 at 13:01

Calculate time difference between two columns of string type in hive without changing the data type string

2 Answers2

Linked