In pyspark you can do it using a window
function:
First let's create the dataframe. Note that you can also directly load it as a dataframe from a csv:
df = spark.createDataFrame(
sc.parallelize(
[[1,20,30,40,1,1],
[1,20,30,40,2,1],
[1,20,30,40,3,1],
[1,20,30,40,4,1],
[1,20,30,40,45,2],
[1,20,30,40,1,2],
[1,30,30,40,2,1],
[1,30,30,40,3,1],
[1,30,30,40,4,1],
[1,30,30,40,5,1]]
),
["v_id","d_id","ip","l_id","delta","event_id"]
)
You have an implicit ordering in your table, we need to create a monotonically increasing id so that we don't end up shuffling it around:
import pyspark.sql.functions as psf
df = df.withColumn(
"rn",
psf.monotonically_increasing_id()
)
+----+----+---+----+-----+--------+----------+
|v_id|d_id| ip|l_id|delta|event_id| rn|
+----+----+---+----+-----+--------+----------+
| 1| 20| 30| 40| 1| 1| 0|
| 1| 20| 30| 40| 2| 1| 1|
| 1| 20| 30| 40| 3| 1| 2|
| 1| 20| 30| 40| 4| 1| 3|
| 1| 20| 30| 40| 45| 2| 4|
| 1| 20| 30| 40| 1| 2|8589934592|
| 1| 30| 30| 40| 2| 1|8589934593|
| 1| 30| 30| 40| 3| 1|8589934594|
| 1| 30| 30| 40| 4| 1|8589934595|
| 1| 30| 30| 40| 5| 1|8589934596|
+----+----+---+----+-----+--------+----------+
Now to compute event_id
and last_event_flag
:
from pyspark.sql import Window
w1 = Window.partitionBy("v_id", "d_id", "l_id", "ip").orderBy("rn")
w2 = Window.partitionBy("v_id", "d_id", "l_id", "ip").orderBy(psf.desc("rn"))
df.withColumn(
"event_id",
psf.sum((df.delta >= 40).cast("int")).over(w1) + 1
).withColumn(
"last_event_flag",
psf.row_number().over(w2) == 1
).drop("rn")
+----+----+---+----+-----+--------+---------------+
|v_id|d_id| ip|l_id|delta|event_id|last_event_flag|
+----+----+---+----+-----+--------+---------------+
| 1| 20| 30| 40| 1| 1| false|
| 1| 20| 30| 40| 2| 1| false|
| 1| 20| 30| 40| 3| 1| false|
| 1| 20| 30| 40| 4| 1| false|
| 1| 20| 30| 40| 45| 2| false|
| 1| 20| 30| 40| 1| 2| true|
| 1| 30| 30| 40| 2| 1| false|
| 1| 30| 30| 40| 3| 1| false|
| 1| 30| 30| 40| 4| 1| false|
| 1| 30| 30| 40| 5| 1| true|
+----+----+---+----+-----+--------+---------------+