treeReduce vs reduceByKey in Spark

Question

I saw the following post a little bit back: Understanding TreeReduce in Spark

I am still trying to exactly understand when to use a treeReduce vs a reduceByKey. I think we can use a universal example like a word count to help me further understand what is going on.

Does it always make sense to use reduceByKey in a word count?
Or is there a particular size of data when treeReduce makes more sense?
Are there particular cases or rules of thumbs when treeReduce is the better option?
Also this may be answered in the above based on reduceByKey but does anything change with reduceByKeyLocally and treeReduce
How do I appropriately determine depth?

Edit: So playing in spark-shell, I think I fundamentally don't understand the concept of treeReduce but hopefully an example and those question help.

res2: Array[(String, Int)] = Array((D,1), (18964,1), (D,1), (1,1), ("",1), ("",1), ("",1), ("",1), ("",1), (1,1))

scala> val reduce = input.reduceByKey(_+_)
reduce: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[11] at reduceByKey at <console>:25

scala> val tree = input.treeReduce(_+_, 2)
<console>:25: error: type mismatch;
 found   : (String, Int)
 required: String
       val tree = input.treeReduce(_+_, 2)

score 8 · Accepted Answer · answered Dec 04 '15 at 04:17

There is a fundamental difference between the two-reduceByKey is only available on key-value pair RDDs, while treeReduce is a generalization of reduce operation on any RDD. reduceByKey is used for implementing treeReduce but they are not related in any other sense.

reduceByKey performs reduction per each key, resulting in an RDD; it is not an "action" in RDD sense but a transformation that returns a ShuffleRDD. This is equivalent to groupByKey followed by a map that does key-wise reduction (check this why using groupByKey is inefficient).

On the other hand, treeAggregate is a generalization of reduce function, inspired from AllReduce. This is an "action" in spark sense, returning the result on the master node. As explained the link posted in your question, after performing local reduce operation, reduce performs rest of the computation on the master, which can be very burdensome (especially in machine learning when the reduce function results in a large vectors or a matrices). Instead, treeReduce perform the reduction in parallel using reduceByKey (this is done by creating a key-value pair RDD on the fly, with the keys determined by the depth of the tree; check implementation here).

So, to answer your first two questions, you have to use reduceByKey for word count since you are interested in getting per word-count and treeReduce is not appropriate here. The other two questions are not related to this topic.

treeReduce vs reduceByKey in Spark

1 Answers1