0

I have to load the incremental load to my base table (say table_stg) everyday once. I get the snapshot of data everyday from various sources in xml format. The id column is supposed to be unique but since data is coming from different sources, there is a chance of duplicate data.

day1: table_stg id,col2,col3,ts,col4 1,a,b,2016-06-24 01:12:27.000532,c 2,e,f,2016-06-24 01:12:27.000532,k 3,a,c,2016-06-24 01:12:27.000532,l

day2: (say the xml is parsed and loaded into table_inter as below)

id,col2,col3,ts,col4 4,a,b,2016-06-25 01:12:27.000417,l 2,e,f,2016-06-25 01:12:27.000417,k 5,w,c,2016-06-25 01:12:27.000417,f 5,w,c,2016-06-25 01:12:27.000417,f

when i put this data ino table_stg, my final output should be: id,col2,col3,ts,col4 1,a,b,2016-06-24 01:12:27.000532,c 2,e,f,2016-06-24 01:12:27.000532,k 3,a,c,2016-06-24 01:12:27.000532,l 4,a,b,2016-06-25 01:12:27.000417,l 5,w,c,2016-06-25 01:12:27.000417,f

What could be the best way to handle these kind of situations(without deleting the table_stg(base table) and reloading the whole data)

user2810706
  • 107
  • 3
  • 12
  • Similar to http://stackoverflow.com/questions/32073388/avoiding-data-duplication-when-loading-data-from-multiple-servers and also somewhat related to http://stackoverflow.com/questions/32004374/refresh-one-hive-table-from-another-hive-table/32007278#32007278 – Samson Scharfrichter Jun 27 '16 at 21:10
  • Plus some clean-up on the input since it may have some duplicates to begin with -- `SELECT id, a, b, c FROM (SELECT id, a, b, c, ROW_NUMBER() OVER (PARTITION BY id ORDER BY` `) AS zz FROM src) xx WHERE zz=1 AND NOT EXISTS ...` – Samson Scharfrichter Jun 27 '16 at 21:14
  • See this question: http://stackoverflow.com/questions/37709411 You can apply the same technique (full join). And of course clean-up duplicates as @Samson Scharfrichter has already mentioned. – leftjoin Jun 28 '16 at 18:51

1 Answers1

0

Hive does allow duplicates on primary and unique keys.You should have an upstream job doing the data cleaning before loading it into the Hive table. You can write a python script for that if data is less or use spark if data size is huge.

spark provides dropDuplicates() method to achieve this.

Gaurav Gupta
  • 159
  • 1
  • 17