I have a Spark Data Frame with a single column and large number of rows (in billions). I am trying to calculate the sum of the values in each row using the code shown below. However, it is very slow. Is there an efficient way to calculate the sum?
val df = sc.parallelize(Array(1,3,5,6,7,10,30)).toDF("colA")
df.show()
df.agg(sum("colA")).first().get(0) //very slow
Similar query was posted here: How to sum the values of one column of a dataframe in spark/scala The focus of this query is however about efficiency.