For the purposes of comparison, suppose we have a table "T" with two columns "A","B". We also have a hiveContext operating in some HDFS database. We make a data frame:
In theory, which of the following is faster:
sqlContext.sql("SELECT A,SUM(B) FROM T GROUP BY A")
or
df.groupBy("A").sum("B")
where "df" is a dataframe referring to T. For these simple kinds of aggregate operations, is there any reason why one should prefer one method over the other?