1

What's the most efficient way to create a column showing the second session given the following dataframe:

from pyspark import SparkContext
from pyspark.sql import HiveContext, Window
from pyspark.sql import functions as F

sc = SparkContext("local")
sqlContext = HiveContext(sc)

df = sqlContext.createDataFrame([
    ("u1", "g1", 0),
    ("u2", "g2", 1),
    ("u1", "g2", 2),
    ("u1", "g3", 3),
], ["UserID", "GameID", "Time"])

df.show()

+------+------+----+
|UserID|GameID|Time|
+------+------+----+
|    u1|    g1|   0|
|    u2|    g2|   1|
|    u1|    g2|   2|
|    u1|    g3|   3|
+------+------+----+

Desired output

I would also like to keep the time if the first game as a column.

+------+------+-----+-----+
|UserID|MinTim|Game1|Game2|
+------+------+-----+-----+
|    u1|     0|   g1|   g2|
|    u1|     2|   g2|   g3|
+------+------+-----+-----+

I was thinking of using a window partition on UserID and then using rowsBetween(0, 1) but came across issues.

Using Spark 1.6 but open to 2.0 solutions.

Kamil Sindi
  • 21,782
  • 19
  • 96
  • 120
  • Possible duplicate of: http://stackoverflow.com/questions/34295642/spark-add-new-column-to-dataframe-with-value-from-previous-row – Kamil Sindi Oct 16 '16 at 15:17

1 Answers1

1
w = Window().partitionBy("UserID").orderBy(F.col("Time"))

(df
 .select("UserID",
         "Time",
         F.col("GameID").alias("Game1"),
         F.lead("GameID").over(w).alias("Game2"))
 .na.drop(subset="Game2")
).show()

+------+----+-----+-----+
|UserID|Time|Game1|Game2|
+------+----+-----+-----+
|    u1|   0|   g1|   g2|
|    u1|   2|   g2|   g3|
+------+----+-----+-----+
Kamil Sindi
  • 21,782
  • 19
  • 96
  • 120