sliding with SCALA vs mllib sliding - two implementations, a bit fiddly but here it is:
import org.apache.spark.mllib.rdd.RDDFunctions._
val rdd1 = sc.parallelize(Seq(
( "key1", "value1"),
( "key2", "value2"),
( "key3", "value3"),
( "key4", "value4"),
( "key5", "value5")
))
val rdd2 = rdd1.sliding(2)
val rdd3 = rdd2.map(x => (x(0), x(1)))
val rdd4 = rdd3.map(x => ((x._1._1, x._2._1),x._1._2, x._2._2))
rdd4.collect
also, the following and this is actually better of course... :
val rdd5 = rdd2.map{case Array(x,y) => ((x._1, y._1), x._2, y._2)}
rdd5.collect
returns in both cases:
res70: Array[((String, String), String, String)] = Array(((key1,key2),value1,value2), ((key2,key3),value2,value3), ((key3,key4),value3,value4), ((key4,key5),value4,value5))
which I believe meets your needs, but not in pyspark.
On Stack Overflow you can find statements that pyspark does not have an equivalent for RDDs unless you "roll your own". You could look at this How to transform data with sliding window over time series data in Pyspark. However, I would advise Data Frames with the use of pyspark.sql.functions.lead() and pyspark.sql.functions.lag(). Somewhat easier.