1

How does a RDD SAMPLE works in spark? What is the functionality of its different parameters i.e. sample(withReplacement, fraction, seed).

I could not find anything relevant on web regarding 'withReplacement' and 'seed' parameters. Please explain with an example.

SPram
  • 59
  • 1
  • 2
  • 5
  • 3
    Possible duplicate of [How do simple random sampling and dataframe SAMPLE function work in Apache Spark (Scala)?](http://stackoverflow.com/questions/32229941/how-do-simple-random-sampling-and-dataframe-sample-function-work-in-apache-spark) – user7337271 Jan 23 '17 at 12:29

1 Answers1

15

fraction and seed are pretty easy to guess -- they are the fraction of elements you want to see in your sample (i.e. sample of .5 will give you a sample of initial RDD containing half of the elements). Seed is random number generator seed. This is important because you might want to be able to hard code the same seed for your tests so that you always get the same results in test, but in prod code replace it with current time in milliseconds or a random number from a good entropy source.

With replacement sampling is a google search aways, e.g. https://www.ma.utexas.edu/users/parker/sampling/repl.htm. In short, if you are sampling with replacement, you can get the same element in sample twice, and w/o replacement you can only get it once. So if your RDD has [Bob, Alice and Carol] then your "with replacement" sample can be [Alice, Alice], but w/o replacement sample can't have duplicates like that.

MK.
  • 33,605
  • 18
  • 74
  • 111