Question : is there any efficient way to chunk a RDD which is having
a big list into several lists without performing collection
Instead of collecting and modifying big list into several lists, you can make big rdd in to multiple small RDDs for further processing...
collecting big RDD is not a good idea. but if you want to divide big rdd in to small i.e. Array[RDD] you can go with below approach was wrtten in scala you can translate in to python by seeing example here.
python docs here
you can go for randomsplits see docs here
you can see how it was implemented from code which is available in git :
/**
* Randomly splits this RDD with the provided weights.
*
* @param weights weights for splits, will be normalized if they don't sum to 1
* @param seed random seed
*
* @return split RDDs in an array
*/
def randomSplit(
weights: Array[Double],
seed: Long = Utils.random.nextLong): Array[RDD[T]] = {
require(weights.forall(_ >= 0),
s"Weights must be nonnegative, but got ${weights.mkString("[", ",", "]")}")
require(weights.sum > 0,
s"Sum of weights must be positive, but got ${weights.mkString("[", ",", "]")}")
withScope {
val sum = weights.sum
val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
normalizedCumWeights.sliding(2).map { x =>
randomSampleWithRange(x(0), x(1), seed)
}.toArray
}
}
Scala Example (not comfortable with python :-)): for python see docs here
import org.apache.log4j.Level
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
/**
* Created by Ram Ghadiyaram
*/
object RDDRandomSplitExample {
org.apache.log4j.Logger.getLogger("org").setLevel(Level.ERROR)
def main(args: Array[String]) {
val spark = SparkSession.builder.
master("local")
.appName("RDDRandomSplitExample")
.getOrCreate()
val y = spark.sparkContext.parallelize(1 to 100)
// break/split big rdd in to small rdd
val splits: Array[RDD[Int]] = y.randomSplit(Array(0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1))
splits.foreach(x => println("number of records in each rdd " + x.count))
}
}
Result :
number of records in each rdd 9
number of records in each rdd 9
number of records in each rdd 8
number of records in each rdd 7
number of records in each rdd 9
number of records in each rdd 17
number of records in each rdd 11
number of records in each rdd 9
number of records in each rdd 7
number of records in each rdd 6
number of records in each rdd 8
Conclusion :
you can see almost equal number of elements in each RDD.
you can process each rdd with out collecting original big rdd