order by
has only one reducer, so slow.I'm trying to find a fast way.sort by
sorts in each reducer,then how can we get global ordering?
I got this by search engine:
select * from
(select title,cast(price as FLOAT) p from tablename
distribute by time
sort by p desc
limit 10 ) t
order by t.p desc
limit 10;
Then try to validate it.
1.Get right answer in my hive table.There are 215666 records in the table named tablename.
SELECT title,cast(price as FLOAT) p
from tablename
WHERE dt='2020-03-08'
and price IS NOT NULL
ORDER BY p DESC
LIMIT 10
;
2.Use the searched clause.
set hive.execution.engine=mr;
set mapred.reduce.tasks=5;
SELECT title,cast(price as FLOAT) p
from tablename
WHERE dt='2020-03-08'
and price IS NOT NULL
DISTRIBUTE BY title
SORT BY p desc
LIMIT 10
;
The result is the same as the right answer!
Here are my questions:
1.Why only return 10 lines? There are 5 reducer, each reducer returns 10, should be 5*10=50?
2.If should return 10 lines, why the result is global ordering? This 10 line is not from the same reducer ? The limit is random, it cannot get global order in 5 reducer.
3.If should return 10 lines, the outer part in the searched clause is redundant?
select * from
(
) t
order by t.p desc
limit 10;