2

I have this code in Python Pandas, with a dataframe 'df' which contains the columns 'Connectivity_Tmstp', 'sensor_id' and 'duration_seconds':

df.set_index('Connectivity_Tmstp', inplace=True)  
grouper_hour = df.groupby([pd.Grouper(freq='H'), 'sensor_id'])`  
result_stop_hour = grouper_hour['duration_seconds'].sum()`  
kpi_df = result_stop_hour.to_frame()`

This code allows me to put the column 'Connectivity_Tmstp' in index, then to do a groupby on hours and sensor_id. Finally I can sum the hours in each group and put the result in a new dataframe like this:

Connectivity_Tmstp       |   sensor_id                |   duration_seconds          

2018-10-14 07:00:00      | 70b3d5e75e003fb7           |          60
                         | 70b3d5e75e004348           |          40
                         | 70b3d5e75e00435e           |          20
2018-11-02 07:00:00      | 70b3d5e75e0043b3           |          80
                         | 70b3d5e75e0043d7           |          10
                         | 70b3d5e75e0043da           |          60
2019-07-18 12:00:00      | 70b3d5e75e003fb8           |          40
                         | 70b3d5e75e00431c           |          10
                         | 70b3d5e75e0043c1           |          20
                         | 70b3d5e75e0043da           |          30 

Do you know how to do the same thing in PySpark?

Thanks for your answer.

Regards, Fab

rafaelc
  • 57,686
  • 15
  • 58
  • 82
fab
  • 21
  • 2

1 Answers1

2

Yes. You can use Window functions:

A good resource: Databricks - Introducing Window functions in Spark-SQL

If you have a granular timeseries and you want to resample it to an hourly fashion: PySpark: how to resample frequencies

from pyspark.sql.window import Window
import pyspark.sql.functions as F

w = Window().partitionBy("sensor_id").orderBy("Connectivity_Tmstp")

df = df.withColumn('sum', F.sum(F.col('duration_seconds')).over(w))
pissall
  • 7,109
  • 2
  • 25
  • 45