Hive giving different results for same query and same data

Question

I have following data in my table.

Running below query through script and giving me different results every time when I execute. For example rptind for id 2, some times it gives 'P' and some times its 'n'. Not sure why it is changing the value for this rptind.

if I execute manually, I see same results always.

select id,eff_dt,ind,rptind,row_number() over 
(partition by id order by(case when ind="ok" then 1 when ind="no" then 2 when ind="not" then 3 end)) 
as rnk from tmp1 order by id;

Not understanding why it is behaving like this. Can someone suggest what might be the cause of this issue.

Thank you, Babu

Non-deterministic behavior explained in this answer: https://stackoverflow.com/a/55909947/2700344 — leftjoin, Dec 13 '20 at 08:23

forpas · Accepted Answer · 2020-12-12T07:55:26.180

...some times it gives 'P' and some times its 'n'

The order that you defined by:

case when ind="ok" then 1 when ind="no" then 2 when ind="not" then 3 end

will sort the rows by ind, but if there are 2 rows for the same id with ind="ok" and different rptinds 'p' and 'n', you did not define which one should be first.

So expand your ordering conditions by defining the sort order for rptind too, like:

select id, eff_dt, ind, rptind,
       row_number() over (
         partition by id 
         order by case ind when 'ok' then 1 when 'no' then 2 when 'not' then 3 end,
                  case rptind when 'p' then 1 when 'n' then 2 end 
       ) as rnk 
from tmp1 
order by id;

You can expand the sort order further, if needed, to include other columns.

score 1 · Answer 2 · answered Dec 12 '20 at 13:37

What value do you want? SQL tables represent unordered sets. You have ties in the order by conditions, so row_number() can -- and does -- enumerate those ties differently. For example:

id   eff_dt       ind   rpt_ind   
 2   2020-07-20    ok      n
 2   2020-07-21    ok      p

The partitioning is on id so both these rows are in a partition. The order by is based on ind and both have the same value. Which is chosen on a particular run is arbitrary. Remember: SQL tables represent unordered sets so there is no ordering to fall back on.

You haven't specified the logic that you actually want. The safest method is to include a unique identifier as the last key in the order by. Usually, such a key is id, but that is clearly not the case here. Perhaps eff_dt along with id is unique (as in the above example):

row_number() over (partition by id
                   order by (case when ind = 'ok' then 1 when ind = 'no' then 2 when ind = 'not' then 3 end),
                            eff_dt desc
                  )

As a final recourse, you could just use rptind, so you prioritize those values in the row_number().

Hive giving different results for same query and same data

2 Answers2