Spark mllib shuffling the data

Question

Does spark mllib package shuffle the data. I have been using randomSplit on the data, however, looking at the splits it looks like that it has the same order.

Is there a way to shuffle data before splitting it?

_mllib package shuffle the data_ - as stated by @eliasah it doesn't. It simply takes a random sample by traversing each partition. _Is there a way to shuffle data before splitting it?_ - it depends on a context. You can always repartition or sort by random value but it is a) expensive, b) requires some effort to avoid caching if you want to get different result each time c) It is harder to get reproducible sample if you need one. — zero323, Jan 22 '16 at 18:18
Between both the comment above, I think we have an answer. What do you suggest @zero323 ? — eliasah, Jan 23 '16 at 10:34
@eliasah If you feel like to answer don't mind me. I'll be happy to upvote if you extract this into something useful :) — zero323, Jan 23 '16 at 10:51

score 2 · Accepted Answer · edited Jan 23 '16 at 12:02

2

I think that you are confusing actual data shuffling with the random seed when splitting. If you set your split seed to a constant, let's say 11L per example, you'll always get the same splits.

And as stated by @zero323 Mllib simply takes a random sample by traversing each partition.

Is there a way to shuffle data before splitting it?

It depends on a context. You can always repartition or sort by random value but it is

Expensive
Requires some effort to avoid caching if you want to get different result each time
It is harder to get reproducible sample if you need one.

Thus my approach is to iterate and yield on the split seed. Which is the main principle of cross-validation. This way you can get the best seed according to evaluation step you are performing. And you have your reproducible sample, but this approach is quite expensive.

I hope this helps.

edited Jan 23 '16 at 12:02

zero323

322,348
103
959
935

answered Jan 23 '16 at 11:37

eliasah

39,588
11
124
154

In the case that I am working on the source data is sorted by the labels, i.e the source data has all label 1s followed by all 0s. If I understand correctly with the way randomSplit works all 0s will endup being in the test set. Furthermore, the training algorithm may see mostly 1 labels. Is that correct? – Meisam Emamjome Jan 25 '16 at 13:52
I think that what you might need is the following : http://stackoverflow.com/questions/32238727/scala-pick-proportion-from-each-user/32241887#32241887 – eliasah Jan 25 '16 at 13:59

Spark mllib shuffling the data

1 Answers1