I am new to spark.
Here is my code:
val Data = sc.parallelize(List(
("I", "India"),
("U", "USA"),
("W", "West")))
val DataArray = sc.broadcast(Data.collect)
val FinalData = DataArray.value
Here FinalData
is of Array[(String, String)]
type.
But I want data to be in the form of RDD[(String, String)]
type.
Can I convert FinalData
to RDD[(String, String)]
type.
More Detail:
I want to join Two RDD So to optimize join condition(For performance point of view) I am broadcasting small RDD to all cluster so that data shuffling will be less.(Indirectly performance will get improved) So for all this I am writting something like this:
//Big Data
val FirstRDD = sc.parallelize(List(****Data of first table****))
//Small Data
val SecondRDD = sc.parallelize(List(****Data of Second table****))
So defintely I will broadcast Small Data set(means SecondRDD)
val DataArray = sc.broadcast(SecondRDD.collect)
val FinalData = DataArray.value
//Here it will give error that
val Join = FirstRDD.leftOuterJoin(FinalData)
Found Array required RDD
That's why I am looking for Array to RDD conversion.