I have a spark dataframe df
that looks like this:
+----+------+------+
|user| value|number|
+----+------+------+
| A | 25 | 13|
| A | 6 | 14|
| A | 2 | 11|
| A | 32 | 17|
| B | 22 | 19|
| B | 42 | 10|
| B | 43 | 32|
| C | 33 | 12|
| C | 90 | 21|
| C | 12 | 32|
| C | 22 | 32|
| C | 64 | 10|
| D | 32 | 23|
| D | 62 | 11|
| D | 32 | 13|
| E | 63 | 17|
+----+------+------+
I want to group the df
per user
and then iterate through each row in the user
groups to parse to a couple of functions that I have defined like below:
def first_function(df):
... # operation on df
return df
def second_function(df):
... # operation on df
return df
def third_function(df):
... # operation on df
return df
Based on this answer I'm aware I can extract a list of unique users like so:
from pyspark.sql import functions as F
users = [user[0] for user in df.select("user").distinct().collect()]
users_list = [df.filter(F.col('user')==user) for user in users]
But it is unclear to me how I can us this user_list
to iterate through my original df
per user
group so that I can feed them to my functions. What is the best way to do this?