2

I have a PySpark data frame that looks like this:

----------------------------
id    A    B    C 
id1   on   on   on
id1   on   off  on 
id1   on   on   on 
id1   on   on   on 
id1   on   on   off
-----------------------------

I am looking for a way to find all unique combinations for selected columns and show their count. The expected output:

----------------------------
id    A    B    C    count
id1   on   on   on   3
id1   on   off  on   1
id1   on   on   off  1
-----------------------------

I see that there is a way to do a similar operation in Pandas, but I need PySpark.

UPD: Also, please note that a unique combination of columns A and B is not the same as a combination of A,B,C. I want all possible combination of every column. Is there a way to achieve it rather than grouping by and counting one combination, another combination, etc.? There are more that 10 columns.

ZygD
  • 22,092
  • 39
  • 79
  • 102
Leo
  • 21
  • 5
  • you can `groupBy()` on all the columns and take the count. see the following question -- https://stackoverflow.com/q/71353304/8279585 – samkart Jun 28 '22 at 06:47
  • samkart, yes, I know. But count of groupby id, A, B is not the same as a count of groupby A, B, C, or A , C. Is there a way to count every possible combination of every column by one line of code? – Leo Jun 28 '22 at 06:51
  • based on your "want" example above -- a simple `groupBy()` followed by count is all you're looking for. the eqv sql would be `select id, a, b, c, count(*) as cnt from table group by 1, 2, 3, 4` – samkart Jun 28 '22 at 06:54
  • I updated the description, so it explains what I want to achieve. – Leo Jun 28 '22 at 07:19

2 Answers2

1

cube can do it. But it displays ALL combinations including if some columns were not taken into account. So you will have to filter afterwards.

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('id1', 'on', 'on', 'on'),
     ('id1', 'on', 'off', 'on'), 
     ('id1', 'on', 'on', 'on'), 
     ('id1', 'on', 'on', 'on'), 
     ('id1', 'on', 'on', 'off')],
    ['id', 'A', 'B', 'C'])

df = df.cube(df.columns).count()
df = df.filter(F.forall(F.array(df.columns), lambda x: x.isNotNull()))

df.show()
# +---+---+---+---+-----+
# | id|  A|  B|  C|count|
# +---+---+---+---+-----+
# |id1| on| on| on|    3|
# |id1| on|off| on|    1|
# |id1| on| on|off|    1|
# +---+---+---+---+-----+

This would count occurrences just in specified columns:

cols = ['A', 'B']
df = df.cube(cols).count()
df = df.filter(F.forall(F.array(df.columns), lambda x: x.isNotNull()))

df.show()
# +---+---+-----+
# |  A|  B|count|
# +---+---+-----+
# | on|off|    1|
# | on| on|    4|
# +---+---+-----+
ZygD
  • 22,092
  • 39
  • 79
  • 102
1

Solution:

df = spark.createDataFrame(
    [
        ("id1", "on","on","on"),  # create your data here, be consistent in the types.
        ("id1", "on","off","on"),
        ("id1", "on","on","on"),
        ("id1", "on","on","on"),
        ("id1", "on","on","off"),
    ],
    ["id", "A" , "B" , "C"]  # add your column names here
)

Apart from Cube function, we also have Rollup function.

cube: It takes a list of columns and applies aggregate expressions to all possible combinations of the grouping columns.

rollup: A similar function to cube is rollup which computes hierarchical subtotals from left to right. With GROUP BY ROLLUP(...) is similar to CUBE but works hierarchically by filling columns from left to right.

from pyspark.sql import functions as F

df = df.rollup(df.columns).count()

df1 = df.na.drop(subset=df.columns)

df1.show()

# ---+---+---+---+-----+
# | id|  A|  B|  C|count|
# +---+---+---+---+-----+
# |id1| on| on|off|    1|
# |id1| on| on| on|    3|
# |id1| on|off| on|    1|
# +---+---+---+---+-----+