It is often cleaner to express a complicated map operation as a series of chained map tasks in code rather than as one large operation. I know the Spark DAG Scheduler performs optimizations but will it also optimize chained operations in this way?
Here's a contrived example where a list of distinct dates is pulled out of a CSV field.:
csv.map(row => row.split(","))
.map(row => row(6)) // extract the proper field
.map(date_field => DateTime.parse(date_field).withTimeAtStartOfDay())
.distinct()
Would this example be more efficient as one map operation followed by a distinct()
?