6

I have a pySpark dataframe, I want to group by a column and then find unique items in another column for each group.

In pandas I could do,

data.groupby(by=['A'])['B'].unique()

I want to do the same with my spark dataframe. I could find the distictCount of items in the group and count also, like this

(spark_df.groupby('A')
    .agg(
        fn.countDistinct(col('B'))
            .alias('unique_count_B'),
        fn.count(col('B'))
            .alias('count_B')
        )
    .show())

But I couldn't find some function to find unique items in the group.

For clarifying more consider a sample dataframe,

df = spark.createDataFrame(
  [(1, "a"), (1, "b"), (1, "a"), (2, "c")],
  ["A", "B"])

I am expecting to get an output like this,

+---+----------+
|  A|  unique_B|
+---+----------+
|  1|  [a, b]  |
|  2|  [c]     |
+---+----------+

How to do get the output as in pandas in pySpark.?

Sreeram TP
  • 11,346
  • 7
  • 54
  • 108

3 Answers3

9

I used collect_set for my purpose like this,

(df.groupby('A')
    .agg(
        fn.collect_set(col('B')).alias('unique_count_B')
    )
    .show())

I get the following output as I need,

+---+--------------+
|  A|unique_count_B|
+---+--------------+
|  1|        [b, a]|
|  2|           [c]|
+---+--------------+
Sreeram TP
  • 11,346
  • 7
  • 54
  • 108
1

you can use the following code, that uses Window functions.

from pyspark.sql import functions as F
from pyspark.sql import Window

df = spark.createDataFrame(
  [(1, "a"), (1, "b"), (1, "a"), (2, "c")],
  ["A", "B"])
win = Window.partitionBy("A", "B")

df.withColumn("distinct AB", 
  F.count("*").over(win)).distinct().show()

The result is:

+---+---+-----------+
|  A|  B|distinct AB|
+---+---+-----------+
|  2|  c|          1|
|  1|  a|          2|
|  1|  b|          1|
+---+---+-----------+
Elior Malul
  • 683
  • 6
  • 8
-1

You could use a list comprehension for this After your grouping let's say the dataframe is in spark_df, you can use:

[row.k for row in spark_df.select('k').distinct().collect()]
Akshay Shah
  • 704
  • 4
  • 11