How I can deal with data skew in SQL on hive?

Question

I have two table,table of netpack_busstop has 100,000,000,the other table of ic_card_trade has 100,000.My query SQL is like this:

    SELECT
        count(*)
    FROM
        ic_card_trade tmpic
    LEFT JOIN netpack_busstop tmpnp 
    ON tmpic.line_no = tmpnp.line_no
    AND tmpic.bus_no = tmpnp.bus_no

I run this job on hadoop use more then 40min,it is so long.

I want to deal with hive sql quick ,use less time.I don't know how to do this by sql.

Have you created clustered or nonclustered indexes on `line_no` and `bus_no` columns? — Maxim Zhukov, Sep 26 '18 at 08:27
I have not created clustered or nonclustered indexes on line_no and bus_no columns. — lee, Sep 26 '18 at 08:32
Read these answers about solving skew join using UNION ALL: https://stackoverflow.com/a/51061613/2700344 and this https://stackoverflow.com/a/40103932/2700344 — leftjoin, Sep 26 '18 at 08:45

score 0 · Answer 1 · answered Sep 26 '18 at 08:38

Since you've not created any indexes on columns you use for joining your data, I believe your execution plan contains table scan operations over both tables, which gain your poor performance.

I think that the root cause of your poor performance is missing indexes and here is a good article how to handle it - Indexes & Views in hive.

score 0 · Answer 2 · answered Sep 26 '18 at 11:08

You can rephrase the query:

select sum(ic.cnt * coalesce(nb.cnt, 1))
from (select line_no, bus_no, count(*) as cnt 
      from ic_card_trade ic
      group by line_no, bus_no
     ) ic left join
     (select line_no, bus_no, count(*) as cnt
      from netpack_busstop nb
      group by line_no, bus_no
     ) nb
     on ic.line_no = nb.line_no and
        ic.bus_no = nb.bus_no;

That is, do the aggregation first and then calculate the number of resulting rows.

How I can deal with data skew in SQL on hive?

2 Answers2