Remove first element in RDD without using filter function

Question

I have built an RDD from a file where each element in the RDD is section from the file separated by a delimiter.

val inputRDD1:RDD[(String,Long)] = myUtilities.paragraphFile(spark,path1)
                                              .coalesce(100*spark.defaultParallelism) 
                                              .zipWithIndex() //RDD[String, Long]
                                              .filter(f => f._2!=0)

The reason I do the last operation above (filter) is to remove the first index 0.

Is there a better way to remove the first element rather than to check each element for the index value as done above?

Thanks!

Yuval Itzchakov · Accepted Answer · 2016-10-27T14:54:36.433

5

One possibility is to use RDD.mapPartitionsWithIndex and to remove the first element from the iterator at index 0:

val inputRDD = myUtilities
                .paragraphFile(spark,path1)
                .coalesce(100*spark.defaultParallelism) 
                .mapPartitionsWithIndex(
                   (index, it) => if (index == 0) it.drop(1) else it,
                    preservesPartitioning = true
                 )

This way, you only ever advance a single item on the first iterator, where all others remain untouched. Is this be more efficient? Probably. Anyway, I'd test both versions to see which one performs better.

edited Oct 27 '16 at 14:54

answered Oct 27 '16 at 14:43

Yuval Itzchakov

146,575
32
257
321

Thanks Yuval. This does seem to perform better by approx 10 seconds. I'll need to test this out with varying file sizes. – user1384205 Nov 02 '16 at 11:13

Remove first element in RDD without using filter function

1 Answers1

Linked