Alternative of "grouped" in spark

Question

I have written my program in Scala and now I want to convert it to Spark.

I am facing problem in implementing grouped that groups elements of a list to a particular length.

Following is the code in Scala that I want to convert in Spark where Population is an RDD.

var pop = Population.grouped(dimensions).toList

I did lot of surfing but all in vain. Can anybody please help me?

what do you mean by convert it into Spark? if Population is already a RDD then pop is also a RDD and RDD is in Spark. So whats your expectation? can you explain with proper example? — Ramesh Maharjan, Mar 19 '18 at 04:45
If you want to group, there is a groupBy function. But it is advisable to do a reduceBy, as that will avoid the un-necessary shuffles. The bottomline though is you need to write the function that will specify how you will group — Ramkumar Venkataraman, Mar 19 '18 at 08:04
Possible duplicate of [Apache Spark's RDD splitting according to the particular size](https://stackoverflow.com/questions/35761980/apache-sparks-rdd-splitting-according-to-the-particular-size) — Alper t. Turker, Mar 19 '18 at 09:36

score 3 · Accepted Answer · answered Mar 19 '18 at 10:29

Below is one way to do it.

scala> val rdd = sc.parallelize('a' to 'z')  //a sample dataset of alphabets
rdd: org.apache.spark.rdd.RDD[Char] = ParallelCollectionRDD[109] at parallelize at <console>:24

scala> rdd.zipWithIndex(). //add indices to rdd elements
     | groupBy(_._2/3). // _._2 is the index which we divide by required dimension(3 here) for groups of that size
     | map(_._2.map(_._1)). //_._2 here is the grouped contents on which we apply map to get the original rdd contents, ditch the index
     | foreach(println)
List(g, h, i)
List(v, w, x)
List(p, q, r)
List(m, n, o)
List(a, b, c)
List(y, z)
List(d, e, f)
List(j, k, l)
List(s, t, u)

As per discussions @ Partition RDD into tuples of length n, this solution involves a shuffle (due to groupBy), so may not be optimal.

I think you should mention the use case also to invite more pertinent answers.

Alternative of "grouped" in spark

1 Answers1