Spark RDD Windowing using pyspark

Question

There is a Spark RDD, called rdd1. It has(key, value) pair and I have a list, whose elements are a tuple(key1,key2).

I want to get a rdd2, with rows `((key1,key2), (value of key1 in rdd1, value of key2 in rdd1)).

Can somebody help me?

rdd1:

key1, value1,
key2, value2,
key3, value3

array: [(key1,key2),(key2,key3)]

Result:

(key1,key2),value1,value2
(key2,key3),value2,value3

I have tried

spark.parallize(array).map(lambda x:)

thebluephantom · Answer 1 · 2018-11-23T19:32:12.313

sliding with SCALA vs mllib sliding - two implementations, a bit fiddly but here it is:

import org.apache.spark.mllib.rdd.RDDFunctions._
val rdd1 = sc.parallelize(Seq(
              ( "key1", "value1"),
              ( "key2", "value2"),
              ( "key3", "value3"),
              ( "key4", "value4"),
              ( "key5", "value5")
          ))
val rdd2 = rdd1.sliding(2)
val rdd3 = rdd2.map(x => (x(0), x(1))) 
val rdd4 = rdd3.map(x => ((x._1._1, x._2._1),x._1._2, x._2._2))  
rdd4.collect

also, the following and this is actually better of course... :

val rdd5 = rdd2.map{case Array(x,y) => ((x._1, y._1), x._2, y._2)}
rdd5.collect

returns in both cases:

res70: Array[((String, String), String, String)] = Array(((key1,key2),value1,value2), ((key2,key3),value2,value3), ((key3,key4),value3,value4), ((key4,key5),value4,value5))

which I believe meets your needs, but not in pyspark.

On Stack Overflow you can find statements that pyspark does not have an equivalent for RDDs unless you "roll your own". You could look at this How to transform data with sliding window over time series data in Pyspark. However, I would advise Data Frames with the use of pyspark.sql.functions.lead() and pyspark.sql.functions.lag(). Somewhat easier.

You will need to convert to pyspark. – thebluephantom Nov 22 '18 at 22:52 — thebluephantom, Nov 22 '18 at 22:52

Spark RDD Windowing using pyspark

1 Answers1