How to sort (order by) big data with hive efficiently?

Question

I want to sort a big dataset efficiently (i.e. with a custom partitioner, like described here: How does the MapReduce sort algorithm work?), but I want to do it with hive.

However, the Hive manual states that "order by" is performed by a single reducer. This surprises me, as pig does implement something similar to the article - pig impl

Am I missing something, or is it that hive simply isn't the right hammer for this job?

Pig is way better in that respect. – John Jiang Nov 28 '22 at 16:57 — John Jiang, Nov 28 '22 at 16:57

score 4 · Accepted Answer · answered Jul 16 '11 at 07:32

I think that Hive is not right tool for the job. At least for now. It is built to be used as OLAP/Report tool and thereof is not optimized to produce large result datasets, since most of the analytical queries produce relatively small result set. As a result - they have good TOP N capability but not good total order.

Just in case if you didn't encounter it before - I am suggesting to look inte Hadoop's terasort example, which is specifically aimed to sort large dataset in a best possible way using MR. http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/examples/terasort/package-summary.html

Hive can be used to generate large HDFS or local files based on the queries. But the issue here is ordering. Hive can do ORDER BY only when it is using a single reducer. That would indeed be quite inefficient. — Olaf, Jul 18 '11 at 15:06

score 1 · Answer 2 · answered May 29 '12 at 16:33

It is not possible to use multiple reducers for doing total ordering in Hive. It has not been implemented yet - https://issues.apache.org/jira/browse/HIVE-1402 .

It will be easier to use pig instead of writing custom MR job, if you want efficient total ordering.

score 0 · Answer 3 · answered Jul 12 '11 at 15:10

0

Hive generates MapReduce job(s) for executing the queries. In your particular case the actual sorting is done by the Hadoop MapReduce framework before the data is fed into the reducer.

answered Jul 12 '11 at 15:10

Olaf

6,249
1
19
37

How to sort (order by) big data with hive efficiently?

3 Answers3