0

Hi guys i wanted to change this in pyspark dataframe

| player_id | stat_week|moves|week_num|
|1          |2022-06-13|    1|      24|
|1          |2022-06-06|   20|      23|
|1          |2022-06-20|    0|      25|
|2          |2022-06-06|   20|      23|
|2          |2022-06-13|    0|      24|
|2          |2022-06-20|    0|      25|
|1          |2022-05-30|   10|      22|
|1          |2022-05-23|   20|      21|
|1          |2022-05-16|   20|      20|

into

| player_id |moves     |week_num   | group by
|1          |(20,20,10)|(20,21,22) |    1
|1          |(20,1,0). |(23,24,25) |    2
|2          |(20,0,0). |(23,24,25) |    1

how can i do to group 3 weeks data and aggregate them as tuple ?

any help appreciated~~

toby X
  • 3
  • 1

1 Answers1

0

First of all, there is no python tuple data structure in pyspark with immutable feature and multiple data types, you need to write your own data type in pyspark. If you just want to collect the element in array-like data type, you can use ArrayType in pyspark.

You can create a week_num reference dataframe and then join back to your dataframe to do the grouping:

week_lst = [int(row['week_num']) for row in df.select('week_num').distinct().orderBy('week_num').collect()]
group_lst = [j//3 for j in range(len(week_lst))]
week_reference = spark.createDataFrame(
    data=[(week, group) for week, group in zip(week_lst, group_lst)],
    schema=['week_num', 'group']
)
df = df.join(week_reference, on='week_num', how='left')

Then you can do your grouping by:

df.groupby('player_id', 'group')\
    .agg(collect_list('moves').alias('moves'),
         collect_list('week_num').alias('week_num'))

If you want to follow the order in the array, you can check this post: collect_list by preserving order based on another variable

Jonathan Lam
  • 1,761
  • 2
  • 8
  • 17