0

order by has only one reducer, so slow.I'm trying to find a fast way.sort by sorts in each reducer,then how can we get global ordering? I got this by search engine:

select * from 
(select title,cast(price as FLOAT) p from tablename 
distribute by time 
sort by p desc 
limit 10 ) t 
order by t.p desc 
limit 10;   

Then try to validate it.
1.Get right answer in my hive table.There are 215666 records in the table named tablename.

SELECT title,cast(price as FLOAT) p
from tablename 
WHERE dt='2020-03-08'
and price IS NOT NULL
ORDER BY p DESC
LIMIT 10
;

2.Use the searched clause.

set hive.execution.engine=mr;
set mapred.reduce.tasks=5; 
SELECT title,cast(price as FLOAT) p
from tablename
WHERE dt='2020-03-08'
and price IS NOT NULL
DISTRIBUTE BY title
SORT BY p desc
LIMIT 10
;

The result is the same as the right answer! Here are my questions:
1.Why only return 10 lines? There are 5 reducer, each reducer returns 10, should be 5*10=50?
2.If should return 10 lines, why the result is global ordering? This 10 line is not from the same reducer ? The limit is random, it cannot get global order in 5 reducer.
3.If should return 10 lines, the outer part in the searched clause is redundant?

select * from 
(
) t 
order by t.p desc 
limit 10;   

1 Answers1

0

Consider using total order partitioner, see https://cwiki.apache.org/confluence/display/Hive/HBaseBulkLoad#HBaseBulkLoad-PrepareRangePartitioning for details (just ignore part with HBase)

serge_k
  • 1,772
  • 2
  • 15
  • 21