How to deal with primary key check like situations in Hive?

Question

I have to load the incremental load to my base table (say table_stg) everyday once. I get the snapshot of data everyday from various sources in xml format. The id column is supposed to be unique but since data is coming from different sources, there is a chance of duplicate data.

day1:table_stg id,col2,col3,ts,col4 1,a,b,2016-06-24 01:12:27.000532,c 2,e,f,2016-06-24 01:12:27.000532,k 3,a,c,2016-06-24 01:12:27.000532,l

day2: (say the xml is parsed and loaded into table_inter as below)

id,col2,col3,ts,col4 4,a,b,2016-06-25 01:12:27.000417,l 2,e,f,2016-06-25 01:12:27.000417,k 5,w,c,2016-06-25 01:12:27.000417,f 5,w,c,2016-06-25 01:12:27.000417,f

when i put this data ino table_stg, my final output should be: id,col2,col3,ts,col4 1,a,b,2016-06-24 01:12:27.000532,c 2,e,f,2016-06-24 01:12:27.000532,k 3,a,c,2016-06-24 01:12:27.000532,l 4,a,b,2016-06-25 01:12:27.000417,l 5,w,c,2016-06-25 01:12:27.000417,f

What could be the best way to handle these kind of situations(without deleting the table_stg(base table) and reloading the whole data)

Similar to http://stackoverflow.com/questions/32073388/avoiding-data-duplication-when-loading-data-from-multiple-servers and also somewhat related to http://stackoverflow.com/questions/32004374/refresh-one-hive-table-from-another-hive-table/32007278#32007278 — Samson Scharfrichter, Jun 27 '16 at 21:10
Plus some clean-up on the input since it may have some duplicates to begin with -- `SELECT id, a, b, c FROM (SELECT id, a, b, c, ROW_NUMBER() OVER (PARTITION BY id ORDER BY` `) AS zz FROM src) xx WHERE zz=1 AND NOT EXISTS ...` — Samson Scharfrichter, Jun 27 '16 at 21:14
See this question: http://stackoverflow.com/questions/37709411 You can apply the same technique (full join). And of course clean-up duplicates as @Samson Scharfrichter has already mentioned. — leftjoin, Jun 28 '16 at 18:51

score 0 · Answer 1 · answered Oct 21 '20 at 07:42

Hive does allow duplicates on primary and unique keys.You should have an upstream job doing the data cleaning before loading it into the Hive table. You can write a python script for that if data is less or use spark if data size is huge.

spark provides dropDuplicates() method to achieve this.

How to deal with primary key check like situations in Hive?

1 Answers1