0

I have two table,table of netpack_busstop has 100,000,000,the other table of ic_card_trade has 100,000.My query SQL is like this:

    SELECT
        count(*)
    FROM
        ic_card_trade tmpic
    LEFT JOIN netpack_busstop tmpnp 
    ON tmpic.line_no = tmpnp.line_no
    AND tmpic.bus_no = tmpnp.bus_no

I run this job on hadoop use more then 40min,it is so long.

I want to deal with hive sql quick ,use less time.I don't know how to do this by sql.

lee
  • 159
  • 1
  • 2
  • 7
  • 1
    Have you created clustered or nonclustered indexes on `line_no` and `bus_no` columns? – Maxim Zhukov Sep 26 '18 at 08:27
  • I have not created clustered or nonclustered indexes on line_no and bus_no columns. – lee Sep 26 '18 at 08:32
  • I create outer table in hive,maping the table with hdfs – lee Sep 26 '18 at 08:36
  • Read these answers about solving skew join using UNION ALL: https://stackoverflow.com/a/51061613/2700344 and this https://stackoverflow.com/a/40103932/2700344 – leftjoin Sep 26 '18 at 08:45

2 Answers2

0

Since you've not created any indexes on columns you use for joining your data, I believe your execution plan contains table scan operations over both tables, which gain your poor performance.

I think that the root cause of your poor performance is missing indexes and here is a good article how to handle it - Indexes & Views in hive.

Maxim Zhukov
  • 10,060
  • 5
  • 44
  • 88
0

You can rephrase the query:

select sum(ic.cnt * coalesce(nb.cnt, 1))
from (select line_no, bus_no, count(*) as cnt 
      from ic_card_trade ic
      group by line_no, bus_no
     ) ic left join
     (select line_no, bus_no, count(*) as cnt
      from netpack_busstop nb
      group by line_no, bus_no
     ) nb
     on ic.line_no = nb.line_no and
        ic.bus_no = nb.bus_no;

That is, do the aggregation first and then calculate the number of resulting rows.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786