4

I want to sort a big dataset efficiently (i.e. with a custom partitioner, like described here: How does the MapReduce sort algorithm work?), but I want to do it with hive.

However, the Hive manual states that "order by" is performed by a single reducer. This surprises me, as pig does implement something similar to the article - pig impl

Am I missing something, or is it that hive simply isn't the right hammer for this job?

Community
  • 1
  • 1
ihadanny
  • 4,377
  • 7
  • 45
  • 76

3 Answers3

4

I think that Hive is not right tool for the job. At least for now. It is built to be used as OLAP/Report tool and thereof is not optimized to produce large result datasets, since most of the analytical queries produce relatively small result set. As a result - they have good TOP N capability but not good total order.

Just in case if you didn't encounter it before - I am suggesting to look inte Hadoop's terasort example, which is specifically aimed to sort large dataset in a best possible way using MR. http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/examples/terasort/package-summary.html

David Gruzman
  • 7,900
  • 1
  • 28
  • 30
  • 1
    Hive can be used to generate large HDFS or local files based on the queries. But the issue here is ordering. Hive can do ORDER BY only when it is using a single reducer. That would indeed be quite inefficient. – Olaf Jul 18 '11 at 15:06
1

It is not possible to use multiple reducers for doing total ordering in Hive. It has not been implemented yet - https://issues.apache.org/jira/browse/HIVE-1402 .

It will be easier to use pig instead of writing custom MR job, if you want efficient total ordering.

Thejas Nair
  • 241
  • 2
  • 5
0

Hive generates MapReduce job(s) for executing the queries. In your particular case the actual sorting is done by the Hadoop MapReduce framework before the data is fed into the reducer.

Olaf
  • 6,249
  • 1
  • 19
  • 37