0

From here:

As per hadoop definitive guide "Within each partition, the back-ground thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort"

I thought a partition corresponds to one key, and thus a reduce task would reduce a bunch of values association with only one key. If there is only one key, isn't the partition already sorted?

After all, this answer from here, to me seems to contradict the previous quote:

Sorting saves time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key in the sorted input data is different than the previous, to put it simply.

It is saying that a reduce task is associated with one key, and since thre is one partition per reduce task, a partition is associated with one key. So how come there must be a sort within each partition if there is only one key?

Mario Ishac
  • 5,060
  • 3
  • 21
  • 52
  • Some datasets are not partitioned. Even within a partition, there may be non-partitioned keys. A typical partition might be a `date`. That `date` might be **part** of a key with another `id`. – Elliott Frisch Sep 30 '18 at 21:29
  • @ElliottFrisch Can you expand more on "Even within a partition, there may be non-partitioned keys."? I'm not sure what you're saying (and that's on me). – Mario Ishac Sep 30 '18 at 22:26
  • The date is when something happened. The id is a customer id. A client id. A transaction id. A payment id. A payment method. Etc. Etc. Partitioning by the date does not mean date is the only key. – Elliott Frisch Sep 30 '18 at 22:29
  • Okay, I see what you're saying now. However, isn't the reducer supposed to be commutative (let's say pre-sorted data was `b`, `a`, but now sorted data is `a`, `b`). If the iterator that the hadoop reducer was fed yielded `a` first then `b`, as opposed to opposite order, are there any situations where it is reasonable that the reducer gives a different result? – Mario Ishac Sep 30 '18 at 22:39
  • That's a requirement of Java's sort as well. When you partition a table in Hive, the partition is just a directory the HDFS files are organized by. – Elliott Frisch Sep 30 '18 at 23:35
  • @ElliottFrisch What does that refer to? If you are talking about the reducer being commutative, doesn't that negate the need for a sort? – Mario Ishac Sep 30 '18 at 23:44

0 Answers0