2

Given a huge data set of integers, what would be the advantages of using map and reduce techniques over traditional sorting algorithms such as quicksort and mergesort?

Terry Li
  • 16,870
  • 30
  • 89
  • 134
  • maybe this will help http://stackoverflow.com/questions/1152732/how-does-the-mapreduce-sort-algorithm-work – yo_man Oct 06 '11 at 18:31

2 Answers2

1

Don't get me wrong, but MapReduce actually use sorting algorithms like quicksort and mergesort to sort the input for the reduce step. MapReduce is not a new sort algorithm, it is just a way to process data. And along the steps it gets sorted, that is just a nice side-effect.

Thomas Jungblut
  • 20,854
  • 6
  • 68
  • 91
  • Thanks for your reply. I was wondering how the processing speeds up sorting our data since we could apply quicksort or mergesort directly on our data. – Terry Li Oct 06 '11 at 18:08
  • Well it won't until you reach the point where the dataset is too large for just a single machine. Or you want to make additional processing along the way, say filter elements in the map step or grouping in the reduce step. – Thomas Jungblut Oct 06 '11 at 18:08
  • Wouldn't mergesort be enough for that? We could possibly divide the dataset, sort them on distributed machines and merge the collected results? – Terry Li Oct 06 '11 at 18:12
  • Of course, MapReduce is no sorting algorithm like I already told. It is just a nice side effect. Actually, MapReduce is doing the same thing you metioned, it takes equal sized splits and distributes them, they become sorted and before the reduce step starts the whole set gets merged. (and partitioned if you have multiple reduces) – Thomas Jungblut Oct 06 '11 at 18:14
  • But you have the additional overhead in mapping and paritioning though. – Thomas Jungblut Oct 06 '11 at 18:19
1

Map/reduce is more or less just a (scalable, common) way of describing a parallel computation. So you'd express a traditional sorting algorithm, like mergesort or quicksort, as a map/reduce if you wanted to do it as a parallel computation.

It's not a question of "is map/reduce better than mergesort or quicksort," because map/reduce is just a tool for implementing a sorting algorithm like mergesort or quicksort in a parallel way.

Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
  • That said, doing a sort "in parallel" does have overheads that will be heavily dependent on how you're parallelizing and the precise implementation. Certainly for small input sizes, just doing it sequentially will be best, but the definition of "small" varies depending on your approach. (Multiple processors on a single machine? A distributed computation spread out across the world?) – Louis Wasserman Oct 06 '11 at 18:49