I have a large PySpark DataFrame in a similar structure to this:
city | store | machine_id | numeric_value | time |
---|---|---|---|---|
London | A | 1 | x | 01/01/2021 14:15:00 |
London | A | 2 | y | 01/01/2021 14:17:00 |
NY | B | 9 | z | 01/01/2021 16:12:00 |
London | A | 1 | w | 01/01/2021 14:20:00 |
London | A | 2 | q | 01/01/2021 14:24:00 |
. | . | . | . | . |
. | . | . | . | . |
I would like to split the data into time windows (of 10 minutes for example) and calculate some statistics (mean, variance, number of distinct values and other custom functions) per machine_id and output a histogram of this statistic per combination of city and store. For example, for each city_store combination a historgram of the variance of "numeric_value" in a time window of 10 minutes.
So far I used groupby to get the data grouped by the columns I need -
interval_window = pyspark.sql.functions.window("time", '10 minutes')
grouped_df = df.groupBy('city', 'store', 'machine_id', interval_window)
From here I applied some pyspark.sql.functions (like var,mean..) using agg but I would like to know how to apply a custom function on a GroupedData object and how can I output histogram of the results per city and store. I dont think I can convert it into pandas DF as this DataFrame is very large and won't fit into the master.
I'm a beginner in spark so if I'm not using the correct objects/functions please let me know.
Thanks!