0

I have a hive external table with data say, (version less than 0.14)

+--------+------+------+------+
|  id   |   A  |   B  |   C  |
+--------+------+------+------+
| 10011 |   10 |    3 |    0 |
| 10012 |    9 |    0 |   40 |
| 10015 |   10 |    3 |    0 |
| 10017 |    9 |    0 |   40 |
+--------+------+------+------+

And I have a delta file having data given below.

+--------+------+------+------+
|  id   |   A  |   B  |   C  |
+--------+------+------+------+
| 10012 |   50 |    3 |   10 |      --> update
| 10013 |   29 |    0 |   40 |      --> insert
| 10014 |   10 |    3 |    0 |      --> update
| 10013 |   19 |    0 |   40 |      --> update
| 10015 |   70 |    3 |    0 |      --> update
| 10016 |   17 |    0 |   40 |      --> insert
+--------+------+------+------+

How can I update my hive table with the delta file, without using sqoop. Any help on how to proceed will be great! Thanks.

1689
  • 117
  • 1
  • 10
  • 1
    Possible duplicate of [Hive: Best way to do incremetal updates on a main table](https://stackoverflow.com/questions/37709411/hive-best-way-to-do-incremetal-updates-on-a-main-table) – leftjoin Jul 22 '19 at 10:09

2 Answers2

1

This is because there is duplicates in the file. How do you know which you should keep? The last one? In that case you can use, for example, the row_number and then get the maximum value. Something like that.

SELECT coalesce(tmp.id,initial.id) as id, 
 coalesce(tmp.A, initial.A) as A, 
 coalesce(tmp.B,initial.B) as B,
 coalesce(tmp.C, initial.C) as C
FROM
    table_a initial
FULL OUTER JOIN 
( SELECT *, row_number() over( partition by id ) as row_num
  ,COUNT(*) OVER (PARTITION BY id) AS cnt
 FROM temp_table 
) tmp
ON initial.id=tmp.id
WHERE  row_num=cnt
OR row_num IS NULL;

Output:

+--------+-----+----+-----+--+
|   id   |  a  | b  |  c  |
+--------+-----+----+-----+--+
| 10011  | 10  | 3  | 0   |
| 10012  | 50  | 3  | 10  |
| 10013  | 19  | 0  | 40  |
| 10014  | 10  | 3  | 0   |
| 10015  | 70  | 3  | 0   |
| 10016  | 17  | 0  | 40  |
| 10017  | 9   | 0  | 40  |
+--------+-----+----+-----+--+
0

You can load the file to a temporary table in hive and then execute a FULL OUTER JOIN between the two tables.

Query Example:

SELECT coalesce(tmp.id,initial.id) as id, 
 coalesce(tmp.A, initial.A) as A, 
 coalesce(tmp.B,initial.B) as B,
 coalesce(tmp.C, initial.C) as C
FROM
    table_a initial
FULL OUTER JOIN
   temp_table tmp on initial.id=tmp.id;

Output

+--------+-----+----+-----+--+
|   id   |  a  | b  |  c  |
+--------+-----+----+-----+--+
| 10011  | 10  | 3  | 0   |
| 10012  | 50  | 3  | 10  |
| 10013  | 29  | 0  | 40  |
| 10013  | 19  | 0  | 40  |
| 10014  | 10  | 3  | 0   |
| 10015  | 70  | 3  | 0   |
| 10016  | 17  | 0  | 40  |
| 10017  | 9   | 0  | 40  |
+--------+-----+----+-----+--+
  • Thank you! but here id 10013 comes 2 times, I only need the latest value. Anyway to get that? – 1689 Jul 22 '19 at 16:10