What's the most efficient way to create a column showing the second session given the following dataframe:
from pyspark import SparkContext
from pyspark.sql import HiveContext, Window
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("u1", "g1", 0),
("u2", "g2", 1),
("u1", "g2", 2),
("u1", "g3", 3),
], ["UserID", "GameID", "Time"])
df.show()
+------+------+----+
|UserID|GameID|Time|
+------+------+----+
| u1| g1| 0|
| u2| g2| 1|
| u1| g2| 2|
| u1| g3| 3|
+------+------+----+
Desired output
I would also like to keep the time if the first game as a column.
+------+------+-----+-----+
|UserID|MinTim|Game1|Game2|
+------+------+-----+-----+
| u1| 0| g1| g2|
| u1| 2| g2| g3|
+------+------+-----+-----+
I was thinking of using a window partition on UserID and then using rowsBetween(0, 1) but came across issues.
Using Spark 1.6 but open to 2.0 solutions.