I know that some of Spark Actions like collect()
cause performance issues.
It has been quoted in documentation
To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus:
rdd.collect().foreach(println)
. This can cause the driver to run out of memory, though,
because collect()
fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take()
: rdd.take(100).foreach(println)
.
And from one more related SE question: Spark runs out of memory when grouping by key
I have come to know that groupByKey(), reduceByKey()
may cause out of memory if parallelism is not set properly.
I did not get enough evidence on other Transformations and Action commands, which have to be used with caution.
These three are the only commands to be tackled? I have doubts about below commands too
aggregateByKey()
sortByKey()
persist()
/cache()
It would be great if you provide information on intensive commands (global across partitions instead of single partition OR low performance commands), which have to be tackled with better guarding.