I have a large table like
I want to change it to a new table : id, date, last_state .
It is very easy by pandas :
df['time_create'] = pd.to_datetime(df['time_create'])
df = df.set_index('time_create')
df = df.sort_index()
df = df.groupby('id').resample('D').last().reset_index()
But it is very hard to implement by pyspark.
I knew:
Resample equivalent in pysaprk is groupby + window :
grouped = df.groupBy('store_product_id', window("time_create", "1 day")).agg(sum("Production").alias('Sum Production'))
here groupby store_product_id , resample in day and calculate sum
Group by and find first or last:
refer to https://stackoverflow.com/a/35226857/1637673
w = Window().partitionBy("store_product_id").orderBy(col("time_create").desc()) (df .withColumn("rn", row_number().over(w)) .where(col("rn") == 1) .select("store_product_id", "time_create", "state"))
This groupby id and get the last row order by time_create .
However what I need is groupby id, resample by day,then get last row order by time_create .
I know this problem may could be solved if I use pandas udf , Applying UDFs on GroupedData in PySpark (with functioning python example)
But is there any way to do this just by pyspark ?