Spark SQL lazy count

Question

I need to use a dataframe count as divisor for calculating percentages.

This is what I'm doing:

scala> val df = Seq(1,1,1,2,2,3).toDF("value")
scala> val overallCount = df.count
scala> df.groupBy("value")
         .agg( count(lit(1)) / overallCount )

But I would like to avoid the action df.count as it will be evaluated immediately.

Accumulators won't help as they will be evaluated in advance.

Is there a way to perform a lazy count over a dataframe?

What do you want to gain by not performing the action immediately? — Shaido, Apr 11 '19 at 08:29
@Shaido, I'm constructing several dataframes and I want them to be evaluated (and cached) only when used. — pedromorfeu, Apr 11 '19 at 08:41

score 3 · Answer 1 · answered Apr 11 '19 at 08:34

3

Instead of using Dataset.count you can use simple query

val overallCount = df.select(count($"*") as "overallCount")

and later crossJoin

df
  .groupBy("value")
  .agg(count(lit(1)) as "groupCount")
  .crossJoin(overallCount)
  .select($"value", $"groupCount" / $"overallCount")

answered Apr 11 '19 at 08:34

user10938362

3,991
2
12
29

Just what I was looking for. But it seems `crossJoin` is triggering an evaluation. – pedromorfeu Apr 11 '19 at 10:01
@PedroH Definitely not alone (tested using [the same methods I described here](https://stackoverflow.com/a/54270537/10938362). But again under certain conditions Spark might have to determine the number of partitions though I cannot think of any specific case where that might be required here. However if that's the case you can always mark corresponding `vals` as `lazy`). – user10938362 Apr 11 '19 at 10:05
Even with `lazy vals`, Spark is evaluating the `vals` when used: ``` lazy val overallCountDF = df.select(count($"*") as "overallCount") ... .crossJoin(overallCountDF) // <-- evaluated ``` – pedromorfeu Apr 11 '19 at 10:12

Spark SQL lazy count

1 Answers1