I'm using Spark (1.5.2) DataFrames and trying to get a Stratified dataset. My data has been prepped for binary classification and there are only the two values for class
, 1 and 0.
val Array(trainingData, testData) = df.randomSplit(Array(0.7, 0.3))
val fractions: Map[Int, Double] = Map(1 -> 0.5, 0 -> 0.5)
val trainingData3 = trainingData.stat.sampleBy("class", fractions, new Random().nextLong)
println("Training True Class = " + trainingData3.where("class=1").count())
println("Training False Class = " + trainingData3.where("class=0").count())
On the console I get an output showing a vastly incorrect ratio of class 1 to 0:
Training True Class = 799845
Training False Class = 32797260