We can conclude this from the plan generated by Spark.
This is the plan for DataFrame syntax-
val employees = spark.createDataFrame(Seq(("E1",100.0), ("E2",200.0),("E3",300.0))).toDF("employee","salary")
employees
.groupBy($"employee")
.agg(
avg($"salary").as("avg_salary")
).explain(true)
Plan -
== Parsed Logical Plan ==
'Aggregate ['employee], [unresolvedalias('employee, None), avg('salary) AS avg_salary#11]
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
== Analyzed Logical Plan ==
employee: string, avg_salary: double
Aggregate [employee#4], [employee#4, avg(salary#5) AS avg_salary#11]
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
== Optimized Logical Plan ==
Aggregate [employee#4], [employee#4, avg(salary#5) AS avg_salary#11]
+- LocalRelation [employee#4, salary#5]
== Physical Plan ==
*(2) HashAggregate(keys=[employee#4], functions=[avg(salary#5)], output=[employee#4, avg_salary#11])
+- Exchange hashpartitioning(employee#4, 10)
+- *(1) HashAggregate(keys=[employee#4], functions=[partial_avg(salary#5)], output=[employee#4, sum#17, count#18L])
+- LocalTableScan [employee#4, salary#5]
As the plan suggests first "HashAggregate" happened with partial average then "exchange hashpartitioning" happened for full average. The conclusion is that catalyst optimized the DataFrame operation as if we programmed with "reduceByKey" syntax. So we needn't take the burden of writing low level code.
Here is how RDD code and plan looks like.
employees
.map(employee => ("key",(employee.getAs[Double]("salary"), 1))) // map entry with a count of 1
.rdd.reduceByKey {
case ((sumL, countL), (sumR, countR)) =>
(sumL + sumR, countL + countR)
}
.mapValues {
case (sum , count) => sum / count
}.toDF().explain(true)
Plan -
== Parsed Logical Plan ==
SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#30, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#31]
+- ExternalRDD [obj#29]
== Analyzed Logical Plan ==
_1: string, _2: double
SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#30, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#31]
+- ExternalRDD [obj#29]
== Optimized Logical Plan ==
SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._1, true, false) AS _1#30, assertnotnull(input[0, scala.Tuple2, true])._2 AS _2#31]
+- ExternalRDD [obj#29]
== Physical Plan ==
*(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._1, true, false) AS _1#30, assertnotnull(input[0, scala.Tuple2, true])._2 AS _2#31]
+- Scan[obj#29]
The plan is optimized and also involves serialization of data into objects which means extra pressure of memory.
Conclusion
I would use daraframe syntax for its simplicity and possibly better performance.