I saw the following post a little bit back: Understanding TreeReduce in Spark
I am still trying to exactly understand when to use a treeReduce vs a reduceByKey. I think we can use a universal example like a word count to help me further understand what is going on.
- Does it always make sense to use reduceByKey in a word count?
- Or is there a particular size of data when treeReduce makes more sense?
- Are there particular cases or rules of thumbs when treeReduce is the better option?
- Also this may be answered in the above based on reduceByKey but does anything change with reduceByKeyLocally and treeReduce
- How do I appropriately determine depth?
Edit: So playing in spark-shell, I think I fundamentally don't understand the concept of treeReduce but hopefully an example and those question help.
res2: Array[(String, Int)] = Array((D,1), (18964,1), (D,1), (1,1), ("",1), ("",1), ("",1), ("",1), ("",1), (1,1))
scala> val reduce = input.reduceByKey(_+_)
reduce: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[11] at reduceByKey at <console>:25
scala> val tree = input.treeReduce(_+_, 2)
<console>:25: error: type mismatch;
found : (String, Int)
required: String
val tree = input.treeReduce(_+_, 2)