Does spark mllib package shuffle the data. I have been using randomSplit on the data, however, looking at the splits it looks like that it has the same order.
Is there a way to shuffle data before splitting it?
Does spark mllib package shuffle the data. I have been using randomSplit on the data, however, looking at the splits it looks like that it has the same order.
Is there a way to shuffle data before splitting it?
I think that you are confusing actual data shuffling with the random seed when splitting. If you set your split seed to a constant, let's say 11L per example, you'll always get the same splits.
And as stated by @zero323 Mllib simply takes a random sample by traversing each partition.
Is there a way to shuffle data before splitting it?
It depends on a context. You can always repartition or sort by random value but it is
Thus my approach is to iterate and yield on the split seed. Which is the main principle of cross-validation. This way you can get the best seed according to evaluation step you are performing. And you have your reproducible sample, but this approach is quite expensive.
I hope this helps.