4

I have built an RDD from a file where each element in the RDD is section from the file separated by a delimiter.

val inputRDD1:RDD[(String,Long)] = myUtilities.paragraphFile(spark,path1)
                                              .coalesce(100*spark.defaultParallelism) 
                                              .zipWithIndex() //RDD[String, Long]
                                              .filter(f => f._2!=0)

The reason I do the last operation above (filter) is to remove the first index 0.

Is there a better way to remove the first element rather than to check each element for the index value as done above?

Thanks!

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
user1384205
  • 1,231
  • 3
  • 20
  • 39

1 Answers1

5

One possibility is to use RDD.mapPartitionsWithIndex and to remove the first element from the iterator at index 0:

val inputRDD = myUtilities
                .paragraphFile(spark,path1)
                .coalesce(100*spark.defaultParallelism) 
                .mapPartitionsWithIndex(
                   (index, it) => if (index == 0) it.drop(1) else it,
                    preservesPartitioning = true
                 )

This way, you only ever advance a single item on the first iterator, where all others remain untouched. Is this be more efficient? Probably. Anyway, I'd test both versions to see which one performs better.

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
  • Thanks Yuval. This does seem to perform better by approx 10 seconds. I'll need to test this out with varying file sizes. – user1384205 Nov 02 '16 at 11:13