2

I have a RDD[(String,Array[String])] and I need to replicate the data inside to increase the size of it.

I've read here https://stackoverflow.com/a/41787801/9759150 with replacemente you can get the same element in sample twice.

For example:

If RDD.count() is, let's say, 35 elements, and I need to generate from it an RDD with 200 elements. How can I do this?

I saw applying sample is like this:

val sampledRDD = rdd.sample(true, fraction, seed)

I do not how can I choose fraction parameter to my problem.

Thank you!

diens
  • 639
  • 8
  • 26

2 Answers2

1

You can see this answer for more information about the meaning of fraction in rdd.sample(). The short story is, it represents the probability of drawing a sample. This means the final rdd won't be guaranteed to be exactly equal to the specified fraction*original size.

I would approach this in the opposite direction:

  1. First, generate an RDD that is simply the original RDD, repeated several times
  2. Now, sample out of that RDD down to the size you want.

Something like:

val rdds = (1 to 10).map(_ => originalRdd)
val bigRdd = sc.union(rdds)
val sampledRdd = bigRdd.sample(true, fraction, seed)

and set fraction such that the final RDD is the size you want:

val fraction = numResultsIWant/100*originalRdd.count()

and we picked 10 there because that was the number of copies of the RDD we created.

Metropolis
  • 2,018
  • 1
  • 19
  • 36
1

I was doing some tests and I figured out that .sample() is able to do the thing that I wanted!. The key is keep with replacement in true (as I said in the question), seed could be whatever (a number, of course), but fraction should be:

val fraction = num_new.toDouble / rdd.count()  // following my examle: num_new is 200, and rdd.count() is 35

val sampledRDD = rdd.sample(true, fraction, seed)

In this case, fraction = 5.71428571428571, that means the sampledRDD will have each element of it fraction repeated times.

diens
  • 639
  • 8
  • 26