HiveQL runs too slowly (need to improve the performance)

Question

I'm working on Hive and new to it.

I am trying to improve the performance of the following code:

INSERT INTO target_table
(col1,col2,col3...)   --- I have 64 columns in total
SELECT
 t1.col1,t1.col2,t1.col3...
FROM (
      SELECT
      ts.col1,ts.col2,ts.col3...
      FROM source_table ts
      LEFT JOIN label_table tb 
       ON ( ts.colx = tb.colx AND ts.coly = tb.coly )
      WHERE
      ts.colx <> '' AND ts.colx <> 'NULL'
      AND ts.colx IS NOT NULL
      AND tb.colx IS NULL 
      LIMIT 10000
      ) t1

The goal of this code is to insert the rows into target_table, that are from the source_table and whose colx is not found in label_table.

The source_table contains about 1.4 billion rows and the label_table have fewer rows but I'll insert more rows into label_table in the future (Its row number will eventually exceed 1 billion).

My problem: this code at present will take about half an hour to process 10,000 rows (the LIMIT is 10,000). If I increase the LIMIT to 1,000,000, my code will get stuck when the reduce process reaches 66.6%. Since I have 1.4 billion rows to be processed, the total time would be too long for me. Is there any way that can help to enhance the performance of this code? Could anyone please give some advice?

Thank you in advance!

`AND ts.colx IS NOT NULL` is redundant condition, because if `ts.colx <> ''`, it cannot be NULL — leftjoin, Jun 25 '18 at 06:44
Thank you! Besides this point, is there any other thing I can make a improvement? — noobie2023, Jun 25 '18 at 07:22
1st Check that join keys in second table do not duplicate rows. 2nd - Try to tune mapper and reducer parallelism, see here: https://stackoverflow.com/a/48487306/2700344 — leftjoin, Jun 25 '18 at 07:27
WHERE may be applied after join. Try to move it into subquery before join: `FROM ( select * from source_table ts where ts.colx <> '' AND ts.colx <> 'NULL' ) ts` — leftjoin, Jun 25 '18 at 07:33

MCardus · Accepted Answer · 2018-06-25T14:11:57.770

You may want to save you subquery

 SELECT
  ts.col1,ts.col2,ts.col3...
  FROM source_table ts
  LEFT JOIN label_table tb 
   ON ( ts.colx = tb.colx AND ts.coly = tb.coly )
  WHERE
  ts.colx <> '' AND ts.colx <> 'NULL'
  AND ts.colx IS NOT NULL
  AND tb.colx IS NULL 
  LIMIT 10000

into a new intermediate table, specially if you need to use this intermediate data multiple times. Then you can aply further optimitzations on that intermediate table such as using ORC, partitioning o statistics gathering.

I would also recommend you to avoid joining tables in Hive, they are usual performance problems.

Finally check out this guide providing general optimitzation tips for Hive

HiveQL runs too slowly (need to improve the performance)

1 Answers1