How to do groupby and find unique items of a column in PySpark

Question

I have a pySpark dataframe, I want to group by a column and then find unique items in another column for each group.

In pandas I could do,

data.groupby(by=['A'])['B'].unique()

I want to do the same with my spark dataframe. I could find the distictCount of items in the group and count also, like this

(spark_df.groupby('A')
    .agg(
        fn.countDistinct(col('B'))
            .alias('unique_count_B'),
        fn.count(col('B'))
            .alias('count_B')
        )
    .show())

But I couldn't find some function to find unique items in the group.

For clarifying more consider a sample dataframe,

df = spark.createDataFrame(
  [(1, "a"), (1, "b"), (1, "a"), (2, "c")],
  ["A", "B"])

I am expecting to get an output like this,

+---+----------+
|  A|  unique_B|
+---+----------+
|  1|  [a, b]  |
|  2|  [c]     |
+---+----------+

How to do get the output as in pandas in pySpark.?

score 9 · Accepted Answer · answered Jun 19 '19 at 08:53

I used collect_set for my purpose like this,

(df.groupby('A')
    .agg(
        fn.collect_set(col('B')).alias('unique_count_B')
    )
    .show())

I get the following output as I need,

+---+--------------+
|  A|unique_count_B|
+---+--------------+
|  1|        [b, a]|
|  2|           [c]|
+---+--------------+

score 1 · Answer 2 · answered Jun 19 '19 at 08:28

1

you can use the following code, that uses Window functions.

from pyspark.sql import functions as F
from pyspark.sql import Window

df = spark.createDataFrame(
  [(1, "a"), (1, "b"), (1, "a"), (2, "c")],
  ["A", "B"])
win = Window.partitionBy("A", "B")

df.withColumn("distinct AB", 
  F.count("*").over(win)).distinct().show()

The result is:

+---+---+-----------+
|  A|  B|distinct AB|
+---+---+-----------+
|  2|  c|          1|
|  1|  a|          2|
|  1|  b|          1|
+---+---+-----------+

answered Jun 19 '19 at 08:28

Elior Malul

683
6
8

I have updated my question with a sample data frame and output I need. – Sreeram TP Jun 19 '19 at 08:47
In this case, try `pyspark.sql.functions.collect_set` instead of `F.count` – Elior Malul Jun 19 '19 at 08:51
yeah, I just figured it out – Sreeram TP Jun 19 '19 at 08:52

score -1 · Answer 3 · answered Jun 19 '19 at 08:03

-1

You could use a list comprehension for this After your grouping let's say the dataframe is in spark_df, you can use:

[row.k for row in spark_df.select('k').distinct().collect()]

answered Jun 19 '19 at 08:03

Akshay Shah

704
4
11

I am not looking for something with loops – Sreeram TP Jun 19 '19 at 08:47

How to do groupby and find unique items of a column in PySpark

3 Answers3