2

Does spark mllib package shuffle the data. I have been using randomSplit on the data, however, looking at the splits it looks like that it has the same order.

Is there a way to shuffle data before splitting it?

zero323
  • 322,348
  • 103
  • 959
  • 935
Meisam Emamjome
  • 508
  • 5
  • 15
  • _mllib package shuffle the data_ - as stated by @eliasah it doesn't. It simply takes a random sample by traversing each partition. _Is there a way to shuffle data before splitting it?_ - it depends on a context. You can always repartition or sort by random value but it is a) expensive, b) requires some effort to avoid caching if you want to get different result each time c) It is harder to get reproducible sample if you need one. – zero323 Jan 22 '16 at 18:18
  • 1
    Between both the comment above, I think we have an answer. What do you suggest @zero323 ? – eliasah Jan 23 '16 at 10:34
  • @eliasah If you feel like to answer don't mind me. I'll be happy to upvote if you extract this into something useful :) – zero323 Jan 23 '16 at 10:51
  • 1
    Ok thanks buddy @zero323 ! – eliasah Jan 23 '16 at 10:52

1 Answers1

2

I think that you are confusing actual data shuffling with the random seed when splitting. If you set your split seed to a constant, let's say 11L per example, you'll always get the same splits.

And as stated by @zero323 Mllib simply takes a random sample by traversing each partition.

Is there a way to shuffle data before splitting it?

It depends on a context. You can always repartition or sort by random value but it is

  1. Expensive
  2. Requires some effort to avoid caching if you want to get different result each time
  3. It is harder to get reproducible sample if you need one.

Thus my approach is to iterate and yield on the split seed. Which is the main principle of cross-validation. This way you can get the best seed according to evaluation step you are performing. And you have your reproducible sample, but this approach is quite expensive.

I hope this helps.

zero323
  • 322,348
  • 103
  • 959
  • 935
eliasah
  • 39,588
  • 11
  • 124
  • 154
  • In the case that I am working on the source data is sorted by the labels, i.e the source data has all label 1s followed by all 0s. If I understand correctly with the way randomSplit works all 0s will endup being in the test set. Furthermore, the training algorithm may see mostly 1 labels. Is that correct? – Meisam Emamjome Jan 25 '16 at 13:52
  • I think that what you might need is the following : http://stackoverflow.com/questions/32238727/scala-pick-proportion-from-each-user/32241887#32241887 – eliasah Jan 25 '16 at 13:59