1

I'm using Spark (1.5.2) DataFrames and trying to get a Stratified dataset. My data has been prepped for binary classification and there are only the two values for class, 1 and 0.

val Array(trainingData, testData) = df.randomSplit(Array(0.7, 0.3))
val fractions: Map[Int, Double] = Map(1 -> 0.5, 0 -> 0.5)

val trainingData3 = trainingData.stat.sampleBy("class", fractions, new Random().nextLong)

println("Training True Class = " + trainingData3.where("class=1").count())
println("Training False Class = " + trainingData3.where("class=0").count())

On the console I get an output showing a vastly incorrect ratio of class 1 to 0:

Training True Class = 799845
Training False Class = 32797260
Peter
  • 9,643
  • 6
  • 61
  • 108

1 Answers1

5

The fraction provided to sampleBy for DataFrames, as with 'sampleByKeyExact' and sampleByKey for RDD's, is not the percentage you want in the end result set. Rather it's the percentage you wish to keep from the original dataset.

To get a 50/50 split you need to compare counts of class 1 and class 0 in the full dataset, get the ratio, and then use those to help select your fractions.

So for example if 98% of records are class 0 and 2% are class 1 and you want a 50/50 split then you might use a fraction of class 1=100% and class 0=2%.

val fractions: Map[Int, Double] = Map(1 -> 1.0, 0 -> 0.02)
Peter
  • 9,643
  • 6
  • 61
  • 108