Merge update records in a final table

Question

I have a user table in Hive of the form:

User: 
Id    String,
Name  String,
Col1  String,
UpdateTimestamp Timestamp

I'm inserting data in this table from a file which has the following format:

I/U,Timestamp when record was written to file, Id, Name, Col1, UpdateTimestamp

e.g. for inserting a user with Id 1:

I,2019-08-21 14:18:41.002947,1,Bob,stuff,123456

and updating col1 for the same user with Id 1:

U,2019-08-21 14:18:45.000000,1,,updatedstuff,123457

The columns which are not updated are returned as null.

Now simple insertion is easy in hive using load in path in a staging table and then ignoring the first two fields from the stage table.

However, how would I go about the update statements? So that my final row in hive looks like below:

1,Bob,updatedstuff,123457

I was thinking to insert all rows in a staging table and then perform some sort of merge query. Any ideas?

I'm lost. You mention a table with 4 columns but you have more columns in the sample data. — Gordon Linoff, Aug 21 '19 at 11:57
The first column in the file indicates whether the record is an insert or an update, while the second column is just the timestamp when the record was written to the file. — orak, Aug 21 '19 at 11:59
Possible duplicate of [Hive: Best way to do incremetal updates on a main table](https://stackoverflow.com/questions/37709411/hive-best-way-to-do-incremetal-updates-on-a-main-table) — leftjoin, Aug 21 '19 at 12:00

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

Typically with a merge statement your "file" would still be unique on ID and the merge statement would determine whether it needs to insert this as a new record, or update values from that record.

However, if the file is non-negotiable and will always have the I/U format, you could break the process up into two steps, the insert, then the updates, as you suggested.

In order to perform updates in Hive, you will need the users table to be stored as ORC and have ACID enabled on your cluster. For my example, I would create the users table with a cluster key, and the transactional table property:

create table test.orc_acid_example_users
(
  id int
  ,name string
  ,col1 string
  ,updatetimestamp timestamp
)
clustered by (id) into 5 buckets
stored as ORC
tblproperties('transactional'='true');

After your insert statements, your Bob record would say "stuff" in col1:

As far as the updates - you could tackle these with an update or merge statement. I think the key here is the null values. It's important to keep the original name, or col1, or whatever, if the staging table from the file has a null value. Here's a merge example which coalesces the staging tables fields. Basically, if there is a value in the staging table, take that, or else fall back to the original value.

merge into test.orc_acid_example_users as t
  using test.orc_acid_example_staging as s
on t.id = s.id
  and s.type = 'U'
when matched
  then update set name = coalesce(s.name,t.name), col1 = coalesce(s.col1, t.col1)

Now Bob will show "updatedstuff"

Quick disclaimer - if you have more than one update for Bob in the staging table, things will get messy. You will need to have a pre-processing step to get the latest non-null values of all the updates prior to doing the update/merge. Hive isn't really a complete transactional DB - it would be preferred for the source to send full user records any time there's an update, instead of just the changed fields only.

score 0 · Answer 2 · answered Aug 21 '19 at 20:18

You can reconstruct each record in the table using you can use last_value() with the null option:

select h.id,
       coalesce(h.name, last_value(h.name, true) over (partition by h.id order by h.timestamp) as name,
       coalesce(h.col1, last_value(h.col1, true) over (partition by h.id order by h.timestamp) as col1,
       update_timestamp
from history h;

You can use row_number() and a subquery if you want the most recent record.

Merge update records in a final table

2 Answers2