1

i have a question about hive mapjoin , i know when a small table join big table , using mapjoin is better, but when i got a sql like this

select a.col1,
       a.col2,
       a.col3, 
       /* there has many columns from table a, ignore..*/
       b.col4,
       b.col5,
       b.col6
  from a
 inner join b
    on (a.id = b.id)
 where b.date = '2018-02-10'
   and b.hour = '10';

Tips:
table b is big table , rows: 10000W+
table a is big table , rows: 10000W+
table b with predicate only return 1000 rows, i think this sql will using mapjoin , but execution plan is join in reduce side...

who can tell me why ??

leftjoin
  • 36,950
  • 8
  • 57
  • 116
DMW
  • 11
  • 2
  • The type of join (map or otherwise) has exactly nothing to do with how many rows are returned from your query. Can a given ID in your table A have multiple rows? – Andrew Feb 20 '18 at 15:29
  • a.id and b.id has no duplicate rows, a:b is 1:1 – DMW Feb 20 '18 at 16:30

2 Answers2

0

I am not an expert in hive, but some times, the tools used as a client of SQL (i.e. MySQL Workbench) has implicit a limit 1000 in the settings. Try to specify a limit youself and force it to a higher value that 1000.

For example, check this image:

enter image description here

This is MySQL Workbench. The limit is automatically added to your query unless you specify a limit yourself.

King Midas
  • 1,442
  • 4
  • 29
  • 50
0

Try to move where clause into the subquery:

select a.col1,
       a.col2,
       a.col3, 
       /* there has many columns from table a, ignore..*/
       b.col4,
       b.col5,
       b.col6
  from a
 inner join (select * from b where b.date = '2018-02-10' and b.hour = '10' )b 
    on a.id = b.id
 ;

Also, intermediate filtered (temporary) table instead of subquery will work 100%, but this is not so efficient.

Also check these Hive configuration parameters:

set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000; --size of table to fit in memory

Join will be converted to map-join if small table does not exceed the size specified by hive.mapjoin.smalltable.filesize parameter.

leftjoin
  • 36,950
  • 8
  • 57
  • 116