Create rows for 0 values when aggregating all combinations of several columns

Question

Using the example in this question, how do I create rows of 0 count when aggregating all possible combinations? When using cube, rows of 0 do not populate.

This is the code and output:

df.cube($"x", $"y").count.show

// +----+----+-----+     
// |   x|   y|count|
// +----+----+-----+
// |null|   1|    1|   <- count of records where y = 1
// |null|   2|    3|   <- count of records where y = 2
// | foo|null|    2|   <- count of records where x = foo
// | bar|   2|    2|   <- count of records where x = bar AND y = 2
// | foo|   1|    1|   <- count of records where x = foo AND y = 1
// | foo|   2|    1|   <- count of records where x = foo AND y = 2
// |null|null|    4|   <- total count of records
// | bar|null|    2|   <- count of records where x = bar
// +----+----+-----+

But this is the desired output (added row 4).

// +----+----+-----+     
// |   x|   y|count|
// +----+----+-----+
// |null|   1|    1|   <- count of records where y = 1
// |null|   2|    3|   <- count of records where y = 2
// | foo|null|    2|   <- count of records where x = foo
// | bar|   1|    0|   <- count of records where x = bar AND y = 1
// | bar|   2|    2|   <- count of records where x = bar AND y = 2
// | foo|   1|    1|   <- count of records where x = foo AND y = 1
// | foo|   2|    1|   <- count of records where x = foo AND y = 2
// |null|null|    4|   <- total count of records
// | bar|null|    2|   <- count of records where x = bar
// +----+----+-----+

Is there another function that could do that?

Oli · Answer 1 · 2022-09-24T14:01:03.900

First, let's see why you do not get combinations that do not appear in your dataset.

def cube(col1: String, cols: String*): RelationalGroupedDataset

Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

As the doc states it, cube is just a fancy group by. You may aslo check that by running explain on your result. You would see that cube is basically an expand (to obtain the nulls) and a group by. Therefore it cannot show you combinations that are not in your dataset. A join would be needed for that so that values that are never in the same record together can "meet".

So let's construct a solution:

// this contains one line per possible combination, even those who are not in the dataset
// note that we set the count to 0
val cartesian = df
    .select("x").distinct
    .crossJoin(df.select("y").distinct)
    .withColumn("count", lit(0))

// A dataset in which (2, 1) does not exist
val df = Seq((1, 1), (1, 2), (2, 2)).toDF("x", "y")

// Let's now union the cube with the Cartesian product (CP) and
// reperform a group by.
// Since the counts were set to zero in the CP, this will not impact the
// counts of the cube. It simply adds "missing" values with a count of 0.
df.cube("x", "y").count
    .union(cartesian)
    .groupBy("x", "y")
    .agg(sum('count) as "count")
    .show

which yields:

+----+----+-----+
|   x|   y|count|
+----+----+-----+
|   2|   2|    1|
|   1|   2|    1|
|   1|   1|    1|
|   2|   1|    0|
|null|null|    3|
|   1|null|    2|
|null|   1|    1|
|null|   2|    2|
|   2|null|    1|
+----+----+-----+

Thank you! A cross join is exactly what I needed. Appreciate your help! — gillygangles, Sep 28 '22 at 15:08

score 1 · Accepted Answer · answered Sep 24 '22 at 14:25

I agree that crossJoin here is the correct approach. But I think afterwards it may be a bit more versatile to use a join instead of a union and groupBy. Especially if there are more aggregations than one count.

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('foo', 1),
     ('foo', 2),
     ('bar', 2),
     ('bar', 2)],
    ['x', 'y'])

df_cartesian = df.select('x').distinct().crossJoin(df.select("y").distinct())
df_cubed = df.cube('x', 'y').count()
df_cubed.join(df_cartesian, ['x', 'y'], 'full').fillna(0, ['count']).show()

# +----+----+-----+
# |   x|   y|count|
# +----+----+-----+
# |null|null|    4|
# |null|   1|    1|
# |null|   2|    3|
# | bar|null|    2|
# | bar|   1|    0|
# | bar|   2|    2|
# | foo|null|    2|
# | foo|   1|    1|
# | foo|   2|    1|
# +----+----+-----+

I agree, utilizing a join provided a more streamlined process. Thank you! — gillygangles, Sep 28 '22 at 15:08

Create rows for 0 values when aggregating all combinations of several columns

2 Answers2