0

I have a table with around 100 million records, on a daily basis we get around 100K records with updates. Currently we are applying a ROW_NUMBER on the timestamp and picking up the latest record by doing UNION ALL.

With this approach we are facing serious performance issues.

Can you suggest any better approach from performance perspective?

INSERT OVERWRITE TABLE tgt_tbl
SELECT * FROM
(
SELECT row_number() over (partition by acct_num order by time_stamp)
FROM 
     (SELECT acct_num , time_stamp FROM tgt_tbl
       UNION ALL 
      SELECT acct_num , time_stamp FROM Incremental table
     )t1
 ) t2
WHERE rnum = 1
Emma
  • 27,428
  • 11
  • 44
  • 69
PK25
  • 11
  • 2
  • Do you have the account number column for this table getting incremented in Ascending order with new data on daily basis ? – Gomz Jul 18 '19 at 05:09
  • account number doest follow any order, but the time stamp is incremental.account_number has other attributes that are updated – PK25 Jul 18 '19 at 22:05
  • Yes, This approach will be depreciated. you can restrict updates limited to the partitions which you are trying to update and use full outer join instead. https://stackoverflow.com/questions/56939313/optimize-the-join-performance-with-hive-partition-table – vikrant rana Oct 01 '19 at 15:01

0 Answers0