I have a massive pyspark dataframe. I have to perform a group by however I am getting serious performance issues. I need to optimise the code so I have been reading that a Reduce by Key is much more efficient.
This an example of the data frame.
a = [('Bob', 562,"Food", "12 May 2018"), ('Bob',880,"Food","01 June 2018"), ('Bob',380,'Household'," 16 June 2018"), ('Sue',85,'Household'," 16 July 2018"), ('Sue',963,'Household'," 16 Sept 2018")]
df = spark.createDataFrame(a, ["Person", "Amount","Budget", "Date"])
Output:
+------+------+---------+-------------+
|Person|Amount| Budget| Date|
+------+------+---------+-------------+
| Bob| 562| Food| 12 May 2018|
| Bob| 880| Food| 01 June 2018|
| Bob| 380|Household| 16 June 2018|
| Sue| 85|Household| 16 July 2018|
| Sue| 963|Household| 16 Sept 2018|
+------+------+---------+-------------+
I have implemented the following code, however as mentioned before, the actual data frame is massive.
df_grouped = df.groupby('person').agg(F.collect_list(F.struct("Amount", "Budget", "Date")).alias("data"))
Ouput:
+------+--------------------------------------------------------------------------------+
|person|data |
+------+--------------------------------------------------------------------------------+
|Sue |[[85,Household, 16 July 2018], [963,Household, 16 Sept 2018]] |
|Bob |[[562,Food,12 May 2018], [880,Food,01 June 2018], [380,Household, 16 June 2018]]|
+------+--------------------------------------------------------------------------------+
With the schema being:
root
|-- person: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Amount: long (nullable = true)
| | |-- Budget: string (nullable = true)
| | |-- Date: string (nullable = true)
I need to convert the group by into a reduce by key such that I can create the same schema as above.