3

So, I tried to test on Spark operations that cause shuffling based on this stackoverflow post: LINK. However, it doesn't make sense for me when the cartesian operation doesn't cause shuffling in Spark since they need to move the partitions across the network in order to put them together locally.

How does Spark actually do its cartesian and distinct operations behind the scene??

Community
  • 1
  • 1
Tim
  • 1,029
  • 2
  • 14
  • 23

1 Answers1

2

Shuffle is an operation which is specific to RDDs of key-value pairs (RDD[(T, U)] commonly described as PairRDDs or PairwiseRDDs) and is more or less equivalent to shuffle phase in Hadoop. A goal of shuffle is to move data to specific executor based on key value and Partitioner.

There are different types of operations in Spark, which require network traffic, but don't use the same type of logic as shuffle and not always require key-value pairs. Cartesian product is one of these operations. It moves data between machines (in fact it causes much more expensive data movements) but doesn't establish relationship between keys and executors.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • So, how do I know which operations will require network traffic in Spark? As you mentioned, `cartesian` is actually an expensive operation based on its data movements. Hence, it would better for us to know what other operations have this behavior in order for us to avoid. – Tim Aug 02 '16 at 00:54
  • 1
    If operations is expressed by `mapPartitions` only (`map`, `filter` etc.) it doesn't require data movement. Otherwise it probably moves data in other ways. – zero323 Aug 02 '16 at 00:57
  • 1
    more info from official spark docs http://spark.apache.org/docs/latest/programming-guide.html#shuffle-operations – DanielVL Aug 02 '16 at 10:55
  • @DanielVL they only talk about the shuffle operations. The document doesn't mention any other network traffic styles like what mentioned before. I had a hard time finding information about this. – Tim Aug 02 '16 at 21:17