I have a pySpark dataframe, I want to group by a column and then find unique items in another column for each group.
In pandas I could do,
data.groupby(by=['A'])['B'].unique()
I want to do the same with my spark dataframe. I could find the distictCount of items in the group and count also, like this
(spark_df.groupby('A')
.agg(
fn.countDistinct(col('B'))
.alias('unique_count_B'),
fn.count(col('B'))
.alias('count_B')
)
.show())
But I couldn't find some function to find unique items in the group.
For clarifying more consider a sample dataframe,
df = spark.createDataFrame(
[(1, "a"), (1, "b"), (1, "a"), (2, "c")],
["A", "B"])
I am expecting to get an output like this,
+---+----------+
| A| unique_B|
+---+----------+
| 1| [a, b] |
| 2| [c] |
+---+----------+
How to do get the output as in pandas in pySpark.?