1

I want to update the existing data if it exists based on some condition(data with higher priority should be updated) and insert new data if not exists.

I have already written a query for this but somehow it is duplicating the number of rows. Here is the full explanation of what I have and what I want to achieve:

What I have: Table 1 - columns - id,info,priority

hive> select * from sample1;
OK
1   123     1.01
2   234     1.02
3   213     1.03
5   213423  1.32
Time taken: 1.217 seconds, Fetched: 4 row(s)

Table 2: columns - id,info,priority

hive> select * from sample2;
OK
1   1234    1.05
2   23412   1.01
3   21      1.05
4   1232    1.1
2   3432423 1.6
3   34324   1.4

What I want is the final table should have only 1 row per id with the data according to the greatest priority:

1   1234    1.05
2   3432423 1.6
3   34324   1.4
4   1232    1.1
5   213423  1.32

The query that I have written is this:

insert overwrite table sample1
select a.id,
case when cast(TRIM(a.prio) as double) > cast(TRIM(b.prio) as double) then a.info else b.info end as info,
case when cast(TRIM(a.prio) as double) > cast(TRIM(b.prio) as double) then a.prio else b.prio end as prio
from sample1 a
join 
sample2 b
on a.id=b.id where b.id in (select distinct(id) from sample1)
union all
select * from sample2 where id not in (select distinct(id) from sample1)
union all
select * from sample1 where id not in (select distinct(id) from sample2);

After running this query, I am getting this result:

hive> select * from sample1;
OK
1   1234    1.05
2   234     1.02
3   21      1.05
2   3432423 1.6
3   34324   1.4
5   213423  1.32
4   1232    1.1

How do I modify the present query to achieve the correct result. Is there any other method/process that I can follow to achieve the end result. I am using hadoop 2.5.2 along with HIVE 1.2.1 . I am working on a 6 node cluster with 5 slaves and 1 NN.

Rishabh Dixit
  • 115
  • 4
  • 16

3 Answers3

0

Use FULL JOIN, it will return all joined rows plus all not joined rows from the left and all not joined rows from the right tables. sample2 table contains duplicated rows per id, this is why join duplicates rows, use row_number() analytic function to select only rows with highest priority from sample2 table:

insert overwrite table sample1
select 
      nvl(a.id, b.id) as id,
      case when cast(TRIM(a.prio) as double) > cast(TRIM(b.prio) as double) then a.info else b.info end as info,
      case when cast(TRIM(a.prio) as double) > cast(TRIM(b.prio) as double) then a.prio else b.prio end as prio
from ( select a.*, row_number() over (partition by id order by prio desc) rn 
         from sample1 a
     ) a
     full join 
          ( select b.*, row_number() over (partition by id order by prio desc) rn
              from sample2 b
          ) b on a.id=b.id and b.rn=1 --join only with highest priority rows
where a.rn=1;

If sample1 table also contains multiple rows per id (it is not in your example), apply the same technique using row_number to the table sample1.

See also the answer about merge using full join: https://stackoverflow.com/a/37744071/2700344

Also as of Hive 2.2 you can use ACID Merge, see examples

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Thank you. I will test this and let you know. – Rishabh Dixit Oct 26 '18 at 06:40
  • I have updated my question with a more accurate situation that I am facing currently. Any help would be appreciated. I sincerely thank you for your support till now. – Rishabh Dixit Oct 28 '18 at 08:04
  • Sir, I am getting multiple rows even after applying row_number function as mentioned by you above. you can check the output here : https://pastebin.com/vy9hjPjq – Rishabh Dixit Oct 28 '18 at 10:05
  • And initial data is the same as you have posted? – leftjoin Oct 28 '18 at 10:47
  • What if I truncate sample 1 and insert everything from scratch using sample 2 ? I have multiple priority attributes for a single row of data and each priority attribute is responsible for 4 different attributes. I need to update these 4 attributes per priority value for each ID. – Rishabh Dixit Oct 28 '18 at 10:48
  • @RishabhDixit added filtering by rn for the first table – leftjoin Oct 28 '18 at 11:06
  • This is the output after the modified query : https://pastebin.com/rqHPAyFJ . After this I also removed all the data from sample1, and executed the query. After the query execution, sample1 remained empty. – Rishabh Dixit Oct 28 '18 at 11:22
0

Since I had multiple id rows for each id, so I firstly consolidated the IDs using a spark script. The solution could be found here : SPARK 2.2.2 - Joining multiple RDDs giving out of memory excepton. Resulting RDD has 124 columns. What should be the optimal joining method? Then I used the query mentioned in the question to get the desired result.

Rishabh Dixit
  • 115
  • 4
  • 16
0

adding to previously good answers! try this also:

insert overwrite table UDB.SAMPLE1
select 
 COALESCE(id2,id )
,COALESCE(info2,info)
,COALESCE(priority2, priority)
from 
UDB.SAMPLE1 TAB1
full outer JOIN
(
select id2, info2, priority2
from
(
select 
 id       as id2
,info     as info2
,priority as priority2
,row_number() over (partition by id order by priority desc) rn
from UDB.SAMPLE2
)TAB2_wt
where TAB2_wt.rn =1
)TAB2
on TAB2.id2 = TAB1.id
;

select * from SAMPLE1;

+-----+----------+-----------+--+
| id  |   info   | priority  |
+-----+----------+-----------+--+
| 1   | 1234     | 1.05      |
| 2   | 3432423  | 1.6       |
| 3   | 34324    | 1.4       |
| 4   | 1232     | 1.1       |
| 5   | 213423   | 1.32      |
+-----+----------+-----------+--+
vikrant rana
  • 4,509
  • 6
  • 32
  • 72
  • Your solution seems good for this data but for multiple priority and ID columns, I doubt that this will work. Please take sample data from [this question](https://stackoverflow.com/questions/53120335/spark-2-2-2-joining-multiple-rdds-giving-out-of-memory-excepton-resulting-rdd). I doubt that we can do this without using window functions. – Rishabh Dixit Dec 07 '18 at 09:24