Below is one way to do it.
scala> val rdd = sc.parallelize('a' to 'z') //a sample dataset of alphabets
rdd: org.apache.spark.rdd.RDD[Char] = ParallelCollectionRDD[109] at parallelize at <console>:24
scala> rdd.zipWithIndex(). //add indices to rdd elements
| groupBy(_._2/3). // _._2 is the index which we divide by required dimension(3 here) for groups of that size
| map(_._2.map(_._1)). //_._2 here is the grouped contents on which we apply map to get the original rdd contents, ditch the index
| foreach(println)
List(g, h, i)
List(v, w, x)
List(p, q, r)
List(m, n, o)
List(a, b, c)
List(y, z)
List(d, e, f)
List(j, k, l)
List(s, t, u)
As per discussions @
Partition RDD into tuples of length n, this solution involves a shuffle (due to groupBy
), so may not be optimal.
I think you should mention the use case also to invite more pertinent answers.