0

I have a production table in hive which gets incremental(changed records/new records) data from external source on daily basis. For values in row are possibly spread across different dates, for example, this is how records in table looks on first day

+---+----+----+
| id|col1|col2|
+---+----+----+
|  1|  a1|  b1|
|  2|  a2|    |
|  3|    |  b3|
+---+----+----+

on second day, we get following -

+---+----+----+
| id|col1|col2|
+---+----+----+
|  4|  a4|    |
|  2|    | b2 |
|  3|  a3|    |
+---+----+----+
which has new record as well as changed records

The result I want to achieve is, merge of rows based on Primary key (id in this case) and produce and output which is -

+---+----+----+
| id|col1|col2|
+---+----+----+
|  1|  a1|  b1|
|  2|  a2| b2 |
|  3|  a3|  b3|
|  4|  a4|  b4|
+---+----+----+

Number of columns are pretty huge , typically in range of 100-150. Aim is to provide latest full view of all the data received so far.How can I do this within hive itself. (ps:it doesnt have to be sorted)

Pushkr
  • 3,591
  • 18
  • 31
  • I don't understand what the data looks like. Where does the first day's data go on the second day? – Gordon Linoff May 17 '18 at 02:33
  • Sorry, I should have explained better. The data from first is remains in table and the second day data is received as csv file and I have to load it in existing hive table in such a way to build the final result looks like output above – Pushkr May 17 '18 at 02:36
  • 1
    Using `FULL JOIN`: https://stackoverflow.com/a/37744071/2700344 – leftjoin May 17 '18 at 06:30
  • @leftjoin Thank you, this indeed worked – Pushkr May 18 '18 at 14:53
  • Empty strings in PK columns (I mean join key) better to convert to NULLs before join, it guarantees they WILL NOT join and create a skew and other side effects like duplication after join. – leftjoin May 18 '18 at 18:22

1 Answers1

0

This can archived using COALESCE and full outer join.

        SELECT COALESCE(a.id ,b.id) as id ,
            COALESCE(a.col1 ,b.col1) as  col1,
            COALESCE(a.col2 ,b.col2) as col2 
        FROM tbl1 a 
        FULL OUTER JOIN table2 b
           on a.id =b.id  
sandeep rawat
  • 4,797
  • 1
  • 18
  • 36
  • This is good solution but unfortunately it doesnt work with hive well because hive treats null and empty spaces differently. So COALESCE fails when there is a string column. – Pushkr May 18 '18 at 14:52
  • trim the string ie trim(col1) and male them null using TBLPROPERTIES('serialization.null.format'='') https://stackoverflow.com/questions/43263038/empty-string-is-not-treated-as-null-in-hive?rq=1 or use case statement in to check null or blank string. – sandeep rawat May 18 '18 at 17:31