57

I can't find any detailed documentation regarding the differences.

I do notice a difference, because when interchanging cube and groupBy function calls, I get different results. I noticed that for the result using cube, I got a lot of null values on the expressions where I used to use groupBy.

ZygD
  • 22,092
  • 39
  • 79
  • 102
Eric Staner
  • 969
  • 2
  • 9
  • 14

3 Answers3

109

These are not intended to work in the same way. groupBy is simply an equivalent of the GROUP BY clause in standard SQL. In other words

table.groupBy($"foo", $"bar")

is equivalent to:

SELECT foo, bar, [agg-expressions] FROM table GROUP BY foo, bar

cube is equivalent to CUBE extension to GROUP BY. It takes a list of columns and applies aggregate expressions to all possible combinations of the grouping columns. Lets say you have data like this:

val df = Seq(("foo", 1L), ("foo", 2L), ("bar", 2L), ("bar", 2L)).toDF("x", "y")
df.show

// +---+---+
// |  x|  y|
// +---+---+
// |foo|  1|
// |foo|  2|
// |bar|  2|
// |bar|  2|
// +---+---+

and you compute cube(x, y) with count as an aggregation:

df.cube($"x", $"y").count.show

// +----+----+-----+     
// |   x|   y|count|
// +----+----+-----+
// |null|   1|    1|   <- count of records where y = 1
// |null|   2|    3|   <- count of records where y = 2
// | foo|null|    2|   <- count of records where x = foo
// | bar|   2|    2|   <- count of records where x = bar AND y = 2
// | foo|   1|    1|   <- count of records where x = foo AND y = 1
// | foo|   2|    1|   <- count of records where x = foo AND y = 2
// |null|null|    4|   <- total count of records
// | bar|null|    2|   <- count of records where x = bar
// +----+----+-----+

A similar function to cube is rollup which computes hierarchical subtotals from left to right:

df.rollup($"x", $"y").count.show
// +----+----+-----+
// |   x|   y|count|
// +----+----+-----+
// | foo|null|    2|   <- count where x is fixed to foo
// | bar|   2|    2|   <- count where x is fixed to bar and y is fixed to  2
// | foo|   1|    1|   ...
// | foo|   2|    1|   ...
// |null|null|    4|   <- count where no column is fixed
// | bar|null|    2|   <- count where x is fixed to bar
// +----+----+-----+

Just for comparison lets see the result of plain groupBy:

df.groupBy($"x", $"y").count.show

// +---+---+-----+
// |  x|  y|count|
// +---+---+-----+
// |foo|  1|    1|   <- this is identical to x = foo AND y = 1 in CUBE or ROLLUP
// |foo|  2|    1|   <- this is identical to x = foo AND y = 2 in CUBE or ROLLUP
// |bar|  2|    2|   <- this is identical to x = bar AND y = 2 in CUBE or ROLLUP
// +---+---+-----+

To summarize:

  • When using plain GROUP BY every row is included only once in its corresponding summary.
  • With GROUP BY CUBE(..) every row is included in summary of each combination of levels it represents, wildcards included. Logically, the shown above is equivalent to something like this (assuming we could use NULL placeholders):

    SELECT NULL, NULL, COUNT(*) FROM table
    UNION ALL
    SELECT x,    NULL, COUNT(*) FROM table GROUP BY x
    UNION ALL
    SELECT NULL, y,    COUNT(*) FROM table GROUP BY y
    UNION ALL
    SELECT x,    y,    COUNT(*) FROM table GROUP BY x, y
    
  • With GROUP BY ROLLUP(...) is similar to CUBE but works hierarchically by filling colums from left to right.

    SELECT NULL, NULL, COUNT(*) FROM table
    UNION ALL
    SELECT x,    NULL, COUNT(*) FROM table GROUP BY x
    UNION ALL
    SELECT x,    y,    COUNT(*) FROM table GROUP BY x, y
    

ROLLUP and CUBE come from data warehousing extensions so if you want to get a better understanding how this works you can also check documentation of your favorite RDMBS. For example PostgreSQL introduced both in 9.5 and these are relatively well documented.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • wouldn't there be a `SELECT y, NULL, COUNT(*) FROM table GROUP BY y` in the `group by rollup` query? – Sam Oct 23 '18 at 17:23
  • @Sam, because of `rollup` represents the given list of expressions and all **prefixes** of the list including the empty list, `cube` represents the given list and all of its possible **subsets**. – Kxrr Jul 06 '21 at 06:42
3

There's one more member in the "family" which can explain it all - GROUPING SETS. We don't have it in PySpark/Scala, but it exists in SQL API.

GROUPING SETS is used to design whatever combination of groupings is required. Others (cube, rollup, groupBy) return predefined existent combinations:

cube("id", "x", "y") will return (), (id), (x), (y), (id, x), (id, y), (x, y), (id, x, y).
(All the possible existent combinations.)

rollup("id", "x", "y") will only return (), (id), (id, x), (id, x, y).
(Combinations which include the beginning of the provided sequence.)

groupBy("id", "x", "y") will only return (id, x, y) combination.


Examples

Input df:

df = spark.createDataFrame(
    [("a", "foo", 1),
     ("a", "foo", 2),
     ("a", "bar", 2),
     ("a", "bar", 2)],
    ["id", "x", "y"])
df.createOrReplaceTempView("df")
  • cube

    df.cube("id", "x", "y").count()
    

    is the same as...

    spark.sql("""
        SELECT id, x, y, count(1) count
        FROM df
        GROUP BY 
            GROUPING SETS (
                (),
                (id),
                (x),
                (y),
                (id, x),
                (id, y),
                (x, y),
                (id, x, y)
            )
    """)
    
    +----+----+----+-----+
    |  id|   x|   y|count|
    +----+----+----+-----+
    |null|null|   2|    3|
    |null|null|null|    4|
    |   a|null|   2|    3|
    |   a| foo|null|    2|
    |   a| foo|   1|    1|
    |   a|null|   1|    1|
    |null| foo|null|    2|
    |   a|null|null|    4|
    |null|null|   1|    1|
    |null| foo|   2|    1|
    |null| foo|   1|    1|
    |   a| foo|   2|    1|
    |null| bar|null|    2|
    |null| bar|   2|    2|
    |   a| bar|null|    2|
    |   a| bar|   2|    2|
    +----+----+----+-----+
    
  • rollup

    df.rollup("id", "x", "y").count()
    

    is the same as... GROUPING SETS ((), (id), (id, x), (id, x, y))

    spark.sql("""
        SELECT id, x, y, count(1) count
        FROM df
        GROUP BY 
            GROUPING SETS (
                (),
                (id),
                --(x),      <- (not used)
                --(y),      <- (not used)
                (id, x),
                --(id, y),  <- (not used)
                --(x, y),   <- (not used)
                (id, x, y)
            )
    """)
    
    +----+----+----+-----+
    |  id|   x|   y|count|
    +----+----+----+-----+
    |null|null|null|    4|
    |   a| foo|null|    2|
    |   a| foo|   1|    1|
    |   a|null|null|    4|
    |   a| foo|   2|    1|
    |   a| bar|null|    2|
    |   a| bar|   2|    2|
    +----+----+----+-----+
    
  • groupBy

    df.groupBy("id", "x", "y").count()
    

    is the same as... GROUPING SETS ((id, x, y))

    spark.sql("""
        SELECT id, x, y, count(1) count
        FROM df
        GROUP BY 
            GROUPING SETS (
                --(),       <- (not used)
                --(id),     <- (not used)
                --(x),      <- (not used)
                --(y),      <- (not used)
                --(id, x),  <- (not used)
                --(id, y),  <- (not used)
                --(x, y),   <- (not used)
                (id, x, y)
            )
    """)
    
    +---+---+---+-----+
    | id|  x|  y|count|
    +---+---+---+-----+
    |  a|foo|  2|    1|
    |  a|foo|  1|    1|
    |  a|bar|  2|    2|
    +---+---+---+-----+
    

Note. All the above return existent combinations. In the example dataframe, there is no row for "id":"a", "x":"bar", "y":1. Even cube does not return it. In order to get all the possible combinations (existent or not) we should do something like the following (crossJoin):

df_cartesian = spark.range(1).toDF('_tmp')
for c in (cols:=["id", "x", "y"]):
    df_cartesian = df_cartesian.crossJoin(df.select(c).distinct())
df_final = (df_cartesian.drop("_tmp")
    .join(df.cube(*cols).count(), cols, 'full')
)
df_final.show()
# +----+----+----+-----+
# |  id|   x|   y|count|
# +----+----+----+-----+
# |null|null|null|    4|
# |null|null|   1|    1|
# |null|null|   2|    3|
# |null| bar|null|    2|
# |null| bar|   2|    2|
# |null| foo|null|    2|
# |null| foo|   1|    1|
# |null| foo|   2|    1|
# |   a|null|null|    4|
# |   a|null|   1|    1|
# |   a|null|   2|    3|
# |   a| bar|null|    2|
# |   a| bar|   1| null|
# |   a| bar|   2|    2|
# |   a| foo|null|    2|
# |   a| foo|   1|    1|
# |   a| foo|   2|    1|
# +----+----+----+-----+
ZygD
  • 22,092
  • 39
  • 79
  • 102
-2
  1. If you do not want null first remove it using below example Dfwithoutnull=df.na.drop("all",seq(col name 1,col name 2)) Above expression will delete null form the original dataframe

2.group by you know I guess.

3.rollup and cube is GROUPING SET operator. Roll-up is a multidimensional aggrigation and treating element in hierarchical manner

And in cube rather than treating element hierarchically a cube does the same thing accross all dimension. You can try grouping_id to understand the level of abstraction