Sorting a huge text file using hadoop

Question

Is it possible to sort a huge text file lexicographically using a mapreduce job which has only map tasks and zero reduce tasks?

The records of the text file is separated by new line character and the size of the file is around 1 Terra Byte.

It will be great if any one can suggest a way to achieve sorting on this huge file.

Is there any reason for not wanting to have a reduce task? – Praveen Sripati Feb 19 '13 at 06:12 — Praveen Sripati, Feb 19 '13 at 06:12

score 3 · Answer 1 · answered Feb 20 '13 at 08:30

3

Used TreeSet in Map method to hold entire data in the input split and persisted it. Finally I got the sorted file!

answered Feb 20 '13 at 08:30

Arun Vasu

297
8
22

If you have any metrics please publish, they will be helpful. Metrics such as, i) time taken for sorting, ii) cluster size, iii) node h/w configuration. – Kash May 14 '13 at 14:00
If this is possible, then your file is evidently small enough to sort in-memory (which is hard to reconcile with the statement that it's 1 TB--- how much RAM does your computer have???). If so, then Faith's (non-Hadoop) answer is most appropriate (because you're essentially using Hadoop as a do-nothing wrapper around a program that would work better without Hadoop). If you're planning on sorting larger datasets in the future, this methods will break down when the files get too big. – Jim Pivarski Oct 29 '13 at 15:10
It does not mean I sort the entire file by keeping it im memory. I used Hadoop APIs to split the file as n number of chunks and sort it individually. In this case, the amount of data to keep in memory is very small considering my hardware environment. – Arun Vasu Nov 21 '13 at 07:19

score 2 · Answer 2 · answered Feb 15 '13 at 22:18

There is in fact a sort example that is bundled with Hadoop. You can look at how the example code works by examining the class org.apache.hadoop.examples.Sort. This itself works pretty well, but if you want more flexibility with your sort, you can check this out.

score 0 · Answer 3 · edited Jan 01 '15 at 19:25

0

Sorting in Hadoop is done using a Partitioner - you can write a custom partitioner to sort according to your business logic needs. Please see this link on writing a custom partitioner http://jugnu-life.blogspot.com/2012/05/custom-partitioner-in-hadoop.html

I do not advocate sorting terabytes of data using plain vanilla linux sort commands - you will need to split the data to fit into memory to sort large file sizes: Parallel sort in linux

Its better and more expedient to use Hadoop MergeSort instead: Hadoop MergeSort

You can look at some Hadoop sorting benchmarks and analysis from the Yahoo Hadoop team (now Hortonworks) here : Hadoop Sort benchmarks

edited Jan 01 '15 at 19:25

Andy Jackson

356
3
13

answered Feb 15 '13 at 14:27

fjxx

945
10
20

Thanks for your valuable inputs. I have tried most of them and all required a reduce phase. I was searching for a Map only sorting and I did it by using in-memory sorting (Eliminated write via the context). I was able to sort 1TB sized text file which is a database dump and able to generate corresponding HFiles under 1.40 hours. – Arun Vasu Feb 20 '13 at 08:26

Sorting a huge text file using hadoop

3 Answers3