5

Assuming I have an RDD containing (Int, Int) tuples. I wish to turn it into a Vector where first Int in tuple is the index and second is the value.

Any Idea how can I do that?

I update my question and add my solution to clarify: My RDD is already reduced by key, and the number of keys is known. I want a vector in order to update a single accumulator instead of multiple accumulators.

There for my final solution was:

reducedStream.foreachRDD(rdd => rdd.collect({case (x: Int,y: Int) => {
  val v = Array(0,0,0,0)
  v(x) = y
  accumulator += new Vector(v)
}}))

Using Vector from accumulator example in documentation.

Roman Nikitchenko
  • 12,800
  • 7
  • 74
  • 110
Noam Shaish
  • 1,613
  • 2
  • 16
  • 37

2 Answers2

4
rdd.collectAsMap.foldLeft(Vector[Int]()){case (acc, (k,v)) => acc updated (k, v)}

Turn the RDD into a Map. Then iterate over that, building a Vector as we go.

You could use justt collect(), but if there are many repetitions of the tuples with the same key that might not fit in memory.

The Archetypal Paul
  • 41,321
  • 20
  • 104
  • 134
3

One key thing: do you really need Vector? Map could be much more suitable.

  • If you really need local Vector, you first need to use .collect() and then do local transformations into Vector. Of course you shall have enough memory for this. But here the real problem is where to find Vector which can be built efficiently from pairs of (index, value). As far as I know Spark MLLib has itself class org.apache.spark.mllib.linalg.Vectors which can create Vector from array of indices and values and even from tuples. Under the hood it uses breeze.linalg. So probably it would be best start for you.

  • If you just need order, you just can use .orderByKey() as you already have RDD[(K,V)]. This way you have ordered stream. Which does not strictly follow your intention but maybe it could suit even better. Now you can drop elements with the same key by .reduceByKey() producing only resulting elements.

  • Finally if you really need large vector, do .orderByKey and then you can produce real vector by doing .flatmap() which maintain counter and drops more than one element for the same index / inserts needed amount of 'default' elements for missing indices.

Hope this is clear enough.

Roman Nikitchenko
  • 12,800
  • 7
  • 74
  • 110
  • These all seem more complex than the solution in my answer. In particular, sort + reduce or sort + flatmap are pretty inefficient, and collectAsMap avoids the issue of needing enough memory (all solutions need to assume the final Vector will fit in memory, of course) – The Archetypal Paul Dec 18 '14 at 22:25
  • OK, probably this is because of my specifics - I usually work with large RDD so your solution for me looks awful because of .updated(). I think @Noam has to re-think if he really needs Vector. – Roman Nikitchenko Dec 18 '14 at 22:29
  • Hmmm... for me 'fits in memory' is usually wrong suggestion ;-). – Roman Nikitchenko Dec 18 '14 at 22:30
  • Sure. But the OP wants a Vector, so it needs to fit :) – The Archetypal Paul Dec 18 '14 at 22:41
  • ^) Key word is 'needs' - solution with flatMap gives real structure which has even correct element places. Even with low memory. OK, it cannot be randomly accessed. Too much here depends if OP really needs Vector. BTW must say your snippet was first time I have seen practical advantage from folding ) - never needed it before. – Roman Nikitchenko Dec 18 '14 at 22:46
  • 2
    folds can do anything (this is actually true: http://www.cs.nott.ac.uk/~gmh/fold.pdf) – The Archetypal Paul Dec 18 '14 at 22:48
  • http://stackoverflow.com/questions/25158780/difference-between-reduce-and-foldleft-fold-in-functional-programming-particula Ooops? – Roman Nikitchenko Dec 18 '14 at 22:49
  • Nope - because I'm folding on the map that results from collectAsMap. If I was foldLeft'ing on the RDD, then that would be bad (compared to reduce) – The Archetypal Paul Dec 18 '14 at 22:51
  • Exactly, I'm about 'everything' ;-). But thank you for this article. Pretty interesting. – Roman Nikitchenko Dec 18 '14 at 22:52
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/67317/discussion-between-roman-nikitchenko-and-paul). – Roman Nikitchenko Dec 19 '14 at 11:12
  • Sorry, not got time (not yet the vacation for me) – The Archetypal Paul Dec 19 '14 at 12:32
  • 1
    In my case I know my RDD is small since its already reduced by key I want a vector to update one accumulator instead of updating multiple accumulators – Noam Shaish Dec 19 '14 at 18:43