1

I have a large PySpark DataFrame in a similar structure to this:

city store machine_id numeric_value time
London A 1 x 01/01/2021 14:15:00
London A 2 y 01/01/2021 14:17:00
NY B 9 z 01/01/2021 16:12:00
London A 1 w 01/01/2021 14:20:00
London A 2 q 01/01/2021 14:24:00
. . . . .
. . . . .

I would like to split the data into time windows (of 10 minutes for example) and calculate some statistics (mean, variance, number of distinct values and other custom functions) per machine_id and output a histogram of this statistic per combination of city and store. For example, for each city_store combination a historgram of the variance of "numeric_value" in a time window of 10 minutes.

So far I used groupby to get the data grouped by the columns I need -

interval_window = pyspark.sql.functions.window("time", '10 minutes')
grouped_df = df.groupBy('city', 'store', 'machine_id', interval_window)

From here I applied some pyspark.sql.functions (like var,mean..) using agg but I would like to know how to apply a custom function on a GroupedData object and how can I output histogram of the results per city and store. I dont think I can convert it into pandas DF as this DataFrame is very large and won't fit into the master.

I'm a beginner in spark so if I'm not using the correct objects/functions please let me know.

Thanks!

  • Does this answer your question? [Applying UDFs on GroupedData in PySpark (with functioning python example)](https://stackoverflow.com/questions/40006395/applying-udfs-on-groupeddata-in-pyspark-with-functioning-python-example) – ggordon Apr 25 '21 at 13:35
  • I managed to run my own custom function with agg function, looks like it's woriking. Now I'm trying to group the data again by city and store only and output histogram of the functions output, haven't found solution for histogram on GroupedData. – Koral Haham Apr 26 '21 at 07:41

0 Answers0