I need to use a dataframe count as divisor for calculating percentages.
This is what I'm doing:
scala> val df = Seq(1,1,1,2,2,3).toDF("value")
scala> val overallCount = df.count
scala> df.groupBy("value")
.agg( count(lit(1)) / overallCount )
But I would like to avoid the action df.count
as it will be evaluated immediately.
Accumulators won't help as they will be evaluated in advance.
Is there a way to perform a lazy count over a dataframe?