There's one more member in the "family" which can explain it all - GROUPING SETS
. We don't have it in PySpark/Scala, but it exists in SQL API.
GROUPING SETS
is used to design whatever combination of groupings is required. Others (cube
, rollup
, groupBy
) return predefined existent combinations:
cube("id", "x", "y")
will return ()
, (id)
, (x)
, (y)
, (id, x)
, (id, y)
, (x, y)
, (id, x, y)
.
(All the possible existent combinations.)
rollup("id", "x", "y")
will only return ()
, (id)
, (id, x)
, (id, x, y)
.
(Combinations which include the beginning of the provided sequence.)
groupBy("id", "x", "y")
will only return (id, x, y)
combination.
Examples
Input df:
df = spark.createDataFrame(
[("a", "foo", 1),
("a", "foo", 2),
("a", "bar", 2),
("a", "bar", 2)],
["id", "x", "y"])
df.createOrReplaceTempView("df")
cube
df.cube("id", "x", "y").count()
is the same as...
spark.sql("""
SELECT id, x, y, count(1) count
FROM df
GROUP BY
GROUPING SETS (
(),
(id),
(x),
(y),
(id, x),
(id, y),
(x, y),
(id, x, y)
)
""")
+----+----+----+-----+
| id| x| y|count|
+----+----+----+-----+
|null|null| 2| 3|
|null|null|null| 4|
| a|null| 2| 3|
| a| foo|null| 2|
| a| foo| 1| 1|
| a|null| 1| 1|
|null| foo|null| 2|
| a|null|null| 4|
|null|null| 1| 1|
|null| foo| 2| 1|
|null| foo| 1| 1|
| a| foo| 2| 1|
|null| bar|null| 2|
|null| bar| 2| 2|
| a| bar|null| 2|
| a| bar| 2| 2|
+----+----+----+-----+
rollup
df.rollup("id", "x", "y").count()
is the same as... GROUPING SETS ((), (id), (id, x), (id, x, y))
spark.sql("""
SELECT id, x, y, count(1) count
FROM df
GROUP BY
GROUPING SETS (
(),
(id),
--(x), <- (not used)
--(y), <- (not used)
(id, x),
--(id, y), <- (not used)
--(x, y), <- (not used)
(id, x, y)
)
""")
+----+----+----+-----+
| id| x| y|count|
+----+----+----+-----+
|null|null|null| 4|
| a| foo|null| 2|
| a| foo| 1| 1|
| a|null|null| 4|
| a| foo| 2| 1|
| a| bar|null| 2|
| a| bar| 2| 2|
+----+----+----+-----+
groupBy
df.groupBy("id", "x", "y").count()
is the same as... GROUPING SETS ((id, x, y))
spark.sql("""
SELECT id, x, y, count(1) count
FROM df
GROUP BY
GROUPING SETS (
--(), <- (not used)
--(id), <- (not used)
--(x), <- (not used)
--(y), <- (not used)
--(id, x), <- (not used)
--(id, y), <- (not used)
--(x, y), <- (not used)
(id, x, y)
)
""")
+---+---+---+-----+
| id| x| y|count|
+---+---+---+-----+
| a|foo| 2| 1|
| a|foo| 1| 1|
| a|bar| 2| 2|
+---+---+---+-----+
Note. All the above return existent combinations. In the example dataframe, there is no row for "id":"a", "x":"bar", "y":1
. Even cube
does not return it. In order to get all the possible combinations (existent or not) we should do something like the following (crossJoin
):
df_cartesian = spark.range(1).toDF('_tmp')
for c in (cols:=["id", "x", "y"]):
df_cartesian = df_cartesian.crossJoin(df.select(c).distinct())
df_final = (df_cartesian.drop("_tmp")
.join(df.cube(*cols).count(), cols, 'full')
)
df_final.show()
# +----+----+----+-----+
# | id| x| y|count|
# +----+----+----+-----+
# |null|null|null| 4|
# |null|null| 1| 1|
# |null|null| 2| 3|
# |null| bar|null| 2|
# |null| bar| 2| 2|
# |null| foo|null| 2|
# |null| foo| 1| 1|
# |null| foo| 2| 1|
# | a|null|null| 4|
# | a|null| 1| 1|
# | a|null| 2| 3|
# | a| bar|null| 2|
# | a| bar| 1| null|
# | a| bar| 2| 2|
# | a| foo|null| 2|
# | a| foo| 1| 1|
# | a| foo| 2| 1|
# +----+----+----+-----+