I have a dataframe as follow:
from pyspark.sql import functions as f
from pyspark.sql.window import Window
df = spark.createDataFrame([
{"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:30:57.000", "Username": "user1", "Region": "US"},
{"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:31:57.014", "Username": "user2", "Region": "US"},
{"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:32:57.914", "Username": "user1", "Region": "MX"},
{"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:35:57.914", "Username": "user2", "Region": "CA"},
{"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:33:57.914", "Username": "user1", "Region": "UK"},
{"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:34:57.914", "Username": "user1", "Region": "GR"},
{"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:36:57.914", "Username": "user2", "Region": "IR"}])
w = Window.partitionBy().orderBy("groupId","Username").orderBy("Username","ts")
df2 = df.withColumn("prev_region", f.lag(df.Region).over(w))
Day | Region | Username | groupId | ts |
---|---|---|---|---|
2021-01-27 | US | user1 | A | 2021-01-27 08:30:57.000 |
2021-01-27 | MX | user1 | A | 2021-01-27 08:32:57.914 |
2021-01-27 | UK | user1 | A | 2021-01-27 08:33:57.914 |
2021-01-27 | GR | user1 | A | 2021-01-27 08:34:57.914 |
2021-01-27 | US | user2 | A | 2021-01-27 08:31:57.014 |
2021-01-27 | CA | user2 | A | 2021-01-27 08:35:57.914 |
2021-01-27 | IR | user2 | A | 2021-01-27 08:36:57.914 |
And, I want to know what was previous region of users so I used lag function.
Day | Region | Username | groupId | ts | prev_region |
---|---|---|---|---|---|
2021-01-27 | US | user1 | A | 2021-01-27 08:30:57.000 | null |
2021-01-27 | MX | user1 | A | 2021-01-27 08:32:57.914 | US |
2021-01-27 | UK | user1 | A | 2021-01-27 08:33:57.914 | MX |
2021-01-27 | GR | user1 | A | 2021-01-27 08:34:57.914 | UK |
2021-01-27 | US | user2 | A | 2021-01-27 08:31:57.014 | GR |
2021-01-27 | CA | user2 | A | 2021-01-27 08:35:57.914 | US |
2021-01-27 | IR | user2 | A | 2021-01-27 08:36:57.914 | CA |
As you see the value of "prev region" column in the first record of user2 expected to be "null"; however, it is wrong value. I would be thankful if you can show me how to fix it.