0

I am new to Hive .I have a main table t1 and temp table t2. Temp table gets uploaded by incoming data everyday. Incoming data can be updated records as well as new records.

table t2(temp):               table t1(main)
id   name                      id      name
1   vinni                      1       vikki
3   anna                       2       amita

I want my main table to have old records, updated records from temp table and new records from temp table.

My main table should have records like below:

id  name
1   vinni
2   amita
3   anna

I tried doing it by full outer join but its not the optimal solution. So how can I achieve this by using left outer join. At the end I do not want my temp table records and the table can be deleted after its data gets loaded to main table.

hairbo
  • 3,113
  • 2
  • 27
  • 34
  • 1
    Possible duplicate of [Hive: Best way to do incremetal updates on a main table](https://stackoverflow.com/questions/37709411/hive-best-way-to-do-incremetal-updates-on-a-main-table) – leftjoin Mar 13 '19 at 12:36
  • 1
    Only FULL JOIN or UNION ALL+row_number are possible solutions like in this answer: https://stackoverflow.com/a/37744071/2700344. FULL JOIN because you need not joined rows from both tables like in your data example, id=2 and id=3. UNION ALL+row_number option may be faster:https://stackoverflow.com/a/44755825/2700344 – leftjoin Mar 13 '19 at 12:45

1 Answers1

0

You can do it with union of two tables then select updated records for already existed ids, e.g.

with union_table as (
  select *, 0 as new_flag from t1
  union all
  select *, 1 as new_flag from t2
),
stage_table as (
  select *, max(new_flag) over (partition by id) as max_flag
    from union_table
),
stage_table2 as (
  select id, name, new_flag
    from stage_table
   where new_flag = max_flag
)
select id, name
  from stage_table2

Windowing max will help you to find out if a particular id needs to be replaced with updated value

serge_k
  • 1,772
  • 2
  • 15
  • 21