how to group by week number and show 3 weeks data in a tuple?

Question

Hi guys i wanted to change this in pyspark dataframe

| player_id | stat_week|moves|week_num|
|1          |2022-06-13|    1|      24|
|1          |2022-06-06|   20|      23|
|1          |2022-06-20|    0|      25|
|2          |2022-06-06|   20|      23|
|2          |2022-06-13|    0|      24|
|2          |2022-06-20|    0|      25|
|1          |2022-05-30|   10|      22|
|1          |2022-05-23|   20|      21|
|1          |2022-05-16|   20|      20|

into

| player_id |moves     |week_num   | group by
|1          |(20,20,10)|(20,21,22) |    1
|1          |(20,1,0). |(23,24,25) |    2
|2          |(20,0,0). |(23,24,25) |    1

how can i do to group 3 weeks data and aggregate them as tuple ?

any help appreciated~~

score 0 · Accepted Answer · answered Aug 16 '22 at 02:25

First of all, there is no python tuple data structure in pyspark with immutable feature and multiple data types, you need to write your own data type in pyspark. If you just want to collect the element in array-like data type, you can use ArrayType in pyspark.

You can create a week_num reference dataframe and then join back to your dataframe to do the grouping:

week_lst = [int(row['week_num']) for row in df.select('week_num').distinct().orderBy('week_num').collect()]
group_lst = [j//3 for j in range(len(week_lst))]
week_reference = spark.createDataFrame(
    data=[(week, group) for week, group in zip(week_lst, group_lst)],
    schema=['week_num', 'group']
)
df = df.join(week_reference, on='week_num', how='left')

Then you can do your grouping by:

df.groupby('player_id', 'group')\
    .agg(collect_list('moves').alias('moves'),
         collect_list('week_num').alias('week_num'))

If you want to follow the order in the array, you can check this post: collect_list by preserving order based on another variable

This is awesome <3 thank you so much you save my day ~~ – toby X Aug 16 '22 at 05:14 — toby X, Aug 16 '22 at 05:14

how to group by week number and show 3 weeks data in a tuple?

1 Answers1