This is my DataFrame in PySpark:
utc_timestamp data feed
2015-10-13 11:00:00+00:00 1 A
2015-10-13 12:00:00+00:00 5 A
2015-10-13 13:00:00+00:00 6 A
2015-10-13 14:00:00+00:00 10 B
2015-10-13 15:00:00+00:00 11 B
The values of data
are cumulative.
I want to get this result (differences between consecutive rows, grouped by feed
):
utc_timestamp data feed
2015-10-13 11:00:00+00:00 1 A
2015-10-13 12:00:00+00:00 4 A
2015-10-13 13:00:00+00:00 1 A
2015-10-13 14:00:00+00:00 10 B
2015-10-13 15:00:00+00:00 1 B
In pandas
I would do it this way:
df["data"] -= (df.groupby("feed")["data"].shift(fill_value=0))
How can I do the same thing in PySpark?