0

If I have a RDD that looks like:

var rdd = Array(4, 5, 7, 8, 9, 5, 3, 2, 1, 2, 13, 12, .....)

How do I select those elements which are located at equal distances, say, every third element, such that,

var rdd1 = Array(4, 8, 3, 2, ...)
var rdd2 = Array(5, 9, 2, 13, ..)
var rdd3 = Array(7, 5, 1, 12, ..)

What I have tried is, using zipWithIndex and then computing if index % 3 = 0/1/2.

var rdd1/2/3 =  rdd.zipWithIndex.filter(case (val, index) => index%3 ==0/1/2)

(Pardon the exact syntax)

The approach works but is very inefficient for large Rdds. What are some other ways you would do this Scala? Thank you. Your help is very appreciated.

zero323
  • 322,348
  • 103
  • 959
  • 935
Kent Carlevi
  • 133
  • 1
  • 11
  • 1
    that looks O(n) to me. How else can you possibly do it ? – sarveshseri Mar 03 '17 at 19:31
  • @SarveshKumarSingh it is possible to do it in a single pass, as described here: http://stackoverflow.com/a/37956034/3669757 This method does involve trading memory for speed. Also, if *exactly* every 3rd (or nth) element is desired, and exactly correct across partition boundaries, then `zipWithIndex` will probaby need to be involved – eje Mar 03 '17 at 21:44
  • 1
    still O(n) ? I thought he wanted it to be more efficient than that. Also the given answer will not be suitable for "large" rdds. – sarveshseri Mar 03 '17 at 22:59
  • One way I can think is using a custom RangePartitioner , which will partition data based on multiple of 3. Then for each partition flatmap . – RBanerjee Mar 04 '17 at 07:48

0 Answers0