How can I select balanced sampling for binary classification?

Question

There is my code, load data from hive, and do sample balance:

// Load SubSet Data
val dataList = DataLoader.loadSubTrainTestData(hiveContext.sql(sampleDataHql))

// Split Data to Train and Test
val data = dataList.randomSplit(Array(0.7, 0.3), seed = 11L)

// Random balance train data
val sampleCount = data(0).map(rec => (rec.label, 1)).reduceByKey(_ + _)

val positiveSample = data(0).filter(_.label == 1).cache()
val positiveSize = positiveSample.count()

val negativeSample = data(0).filter(_.label == 0).cache()
val negativeSize = negativeSample.count()

// Build train data
val trainData = positiveSample ++
negativeSample.sample(withReplacement = false, 1.0 * positiveSize.toFloat / negativeSize, System.nanoTime())

// Data size
val trainDataSize = positiveSize + negativeSize
val testDataSize = trainDataSize * 3.0 / 7.0

and i calculate the trainDataSize and testDataSize for evaluate the model confidence

I'm not sure that I understand what do you mean by sample balance. I've never heard of such thing. What are you trying to accomplish ? What is your data ? — eliasah, Jul 01 '16 at 05:28
That depends on the task at hand and the model you are trying to train. What you are saying isn't always true. — eliasah, Jul 01 '16 at 06:34
yeah, you are right, and i need to balance the positive and negative. — Dylan Wang, Jul 01 '16 at 06:37
ok but what is your data like ? is it an RDD a DataFrame ? what format ? — eliasah, Jul 01 '16 at 06:44
have you taken a look at this ?http://stackoverflow.com/questions/32238727/stratified-sampling-in-spark/32241887#32241887 — eliasah, Jul 01 '16 at 06:47
the 'dataList' in code is RDD[LabeledPoint], and rec.label = 1/0 means the sample is positive or negative. i want to find some smart way to optimize the code. — Dylan Wang, Jul 01 '16 at 06:48
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/116163/discussion-between-eliasah-and-dylan-wang). — eliasah, Jul 01 '16 at 07:00

score 2 · Accepted Answer · edited Jul 01 '16 at 11:15

Ok I haven't tested this code, but it should go like this :

val data: RDD[LabeledPoint] = ???

val fractions: Map[Double, Double] = Map(0.0 -> 0.5, 1.0 -> 0.5)
val sampledData: RDD[LabeledPoint] = data
  .keyBy(_.label)
  .sampleByKeyExact(false, fractions)  // Optionally with seed
  .values

You can convert your LabeledPoint into PairRDDs than apply a sampleByKeyExact using the fractions you wish to use.

How can I select balanced sampling for binary classification?

1 Answers1