The following is my dataframe
:
df = spark.createDataFrame([
(0, 1),
(0, 2),
(0, 5),
(1, 1),
(1, 2),
(1, 3),
(1, 5),
(2, 1),
(2, 2)
], ["id", "product"])
I need to do a groupBy
of id
and collect all the items as shown below, but I need to check the product count and if it is less than 2, that should not be there it collected items.
For example, product 3 is repeated only once, i.e. count of 3 is 1, which is less than 2, so it should not be available in following dataframe. Looks like I need to do two groupBy
s:
Expected output:
+---+------------+
| id| items|
+---+------------+
| 0| [1, 2, 5]|
| 1| [1, 2, 5]|
| 2| [1, 2]|
+---+------------+