How to turn a known structured RDD to Vector

Question

Assuming I have an RDD containing (Int, Int) tuples. I wish to turn it into a Vector where first Int in tuple is the index and second is the value.

Any Idea how can I do that?

I update my question and add my solution to clarify: My RDD is already reduced by key, and the number of keys is known. I want a vector in order to update a single accumulator instead of multiple accumulators.

There for my final solution was:

reducedStream.foreachRDD(rdd => rdd.collect({case (x: Int,y: Int) => {
  val v = Array(0,0,0,0)
  v(x) = y
  accumulator += new Vector(v)
}}))

Using Vector from accumulator example in documentation.

What happens if the index appears twice (or more) in your RDD? — The Archetypal Paul, Dec 18 '14 at 21:50
@Paul My RDD is reduced by key there for I know I wont have duplicate key, saying that, in my case i dont think it matters, I would just update the accumulator several times instead of once — Noam Shaish, Dec 23 '14 at 19:18

The Archetypal Paul · Accepted Answer · 2014-12-18T22:06:43.260

4

rdd.collectAsMap.foldLeft(Vector[Int]()){case (acc, (k,v)) => acc updated (k, v)}

Turn the RDD into a Map. Then iterate over that, building a Vector as we go.

You could use justt collect(), but if there are many repetitions of the tuples with the same key that might not fit in memory.

edited Dec 18 '14 at 22:06

answered Dec 18 '14 at 21:54

The Archetypal Paul

41,321
20
104
134

>8-[] @Paul Is it acceptable to use `.updated()` here which actually gives copy? I understand `people want keeping things simple` but ... – Roman Nikitchenko Dec 18 '14 at 22:26
Vector is immutable. There isn't really any other choice. Also Vector is a persistent data structure, it doesn't actually copy everything (the new "copy" shares most of its structure with the older Vector) – The Archetypal Paul Dec 18 '14 at 22:43
Your answer was closest to what I was looking for and pointed me to the solution – Noam Shaish Dec 19 '14 at 18:44
The best part of this discussion for me is good point about folding approach, thank you @Paul. – Roman Nikitchenko Dec 20 '14 at 13:58

score 3 · Answer 2 · answered Dec 18 '14 at 22:20

3

One key thing: do you really need Vector? Map could be much more suitable.

If you really need local Vector, you first need to use .collect() and then do local transformations into Vector. Of course you shall have enough memory for this. But here the real problem is where to find Vector which can be built efficiently from pairs of (index, value). As far as I know Spark MLLib has itself class org.apache.spark.mllib.linalg.Vectors which can create Vector from array of indices and values and even from tuples. Under the hood it uses breeze.linalg. So probably it would be best start for you.
If you just need order, you just can use .orderByKey() as you already have RDD[(K,V)]. This way you have ordered stream. Which does not strictly follow your intention but maybe it could suit even better. Now you can drop elements with the same key by .reduceByKey() producing only resulting elements.
Finally if you really need large vector, do .orderByKey and then you can produce real vector by doing .flatmap() which maintain counter and drops more than one element for the same index / inserts needed amount of 'default' elements for missing indices.

Hope this is clear enough.

answered Dec 18 '14 at 22:20

Roman Nikitchenko

12,800
7
74
110

These all seem more complex than the solution in my answer. In particular, sort + reduce or sort + flatmap are pretty inefficient, and collectAsMap avoids the issue of needing enough memory (all solutions need to assume the final Vector will fit in memory, of course) – The Archetypal Paul Dec 18 '14 at 22:25
OK, probably this is because of my specifics - I usually work with large RDD so your solution for me looks awful because of .updated(). I think @Noam has to re-think if he really needs Vector. – Roman Nikitchenko Dec 18 '14 at 22:29
Hmmm... for me 'fits in memory' is usually wrong suggestion ;-). – Roman Nikitchenko Dec 18 '14 at 22:30
Sure. But the OP wants a Vector, so it needs to fit :) – The Archetypal Paul Dec 18 '14 at 22:41
^) Key word is 'needs' - solution with flatMap gives real structure which has even correct element places. Even with low memory. OK, it cannot be randomly accessed. Too much here depends if OP really needs Vector. BTW must say your snippet was first time I have seen practical advantage from folding ) - never needed it before. – Roman Nikitchenko Dec 18 '14 at 22:46
2

folds can do anything (this is actually true: http://www.cs.nott.ac.uk/~gmh/fold.pdf) – The Archetypal Paul Dec 18 '14 at 22:48
http://stackoverflow.com/questions/25158780/difference-between-reduce-and-foldleft-fold-in-functional-programming-particula Ooops? – Roman Nikitchenko Dec 18 '14 at 22:49
Nope - because I'm folding on the map that results from collectAsMap. If I was foldLeft'ing on the RDD, then that would be bad (compared to reduce) – The Archetypal Paul Dec 18 '14 at 22:51
Exactly, I'm about 'everything' ;-). But thank you for this article. Pretty interesting. – Roman Nikitchenko Dec 18 '14 at 22:52
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/67317/discussion-between-roman-nikitchenko-and-paul). – Roman Nikitchenko Dec 19 '14 at 11:12
Sorry, not got time (not yet the vacation for me) – The Archetypal Paul Dec 19 '14 at 12:32
1

In my case I know my RDD is small since its already reduced by key I want a vector to update one accumulator instead of updating multiple accumulators – Noam Shaish Dec 19 '14 at 18:43

How to turn a known structured RDD to Vector

2 Answers2

Linked