I'm working on Hive and new to it.
I am trying to improve the performance of the following code:
INSERT INTO target_table
(col1,col2,col3...) --- I have 64 columns in total
SELECT
t1.col1,t1.col2,t1.col3...
FROM (
SELECT
ts.col1,ts.col2,ts.col3...
FROM source_table ts
LEFT JOIN label_table tb
ON ( ts.colx = tb.colx AND ts.coly = tb.coly )
WHERE
ts.colx <> '' AND ts.colx <> 'NULL'
AND ts.colx IS NOT NULL
AND tb.colx IS NULL
LIMIT 10000
) t1
The goal of this code is to insert the rows into target_table
, that are from the source_table
and whose colx
is not found in label_table
.
The source_table
contains about 1.4 billion rows and the label_table
have fewer rows but I'll insert more rows into label_table
in the future (Its row number will eventually exceed 1 billion).
My problem: this code at present will take about half an hour to process 10,000 rows (the LIMIT
is 10,000
). If I increase the LIMIT
to 1,000,000
, my code will get stuck when the reduce process reaches 66.6%. Since I have 1.4 billion rows to be processed, the total time would be too long for me. Is there any way that can help to enhance the performance of this code? Could anyone please give some advice?
Thank you in advance!