0

There is my code, load data from hive, and do sample balance:

// Load SubSet Data
val dataList = DataLoader.loadSubTrainTestData(hiveContext.sql(sampleDataHql))

// Split Data to Train and Test
val data = dataList.randomSplit(Array(0.7, 0.3), seed = 11L)

// Random balance train data
val sampleCount = data(0).map(rec => (rec.label, 1)).reduceByKey(_ + _)

val positiveSample = data(0).filter(_.label == 1).cache()
val positiveSize = positiveSample.count()

val negativeSample = data(0).filter(_.label == 0).cache()
val negativeSize = negativeSample.count()

// Build train data
val trainData = positiveSample ++
negativeSample.sample(withReplacement = false, 1.0 * positiveSize.toFloat / negativeSize, System.nanoTime())

// Data size
val trainDataSize = positiveSize + negativeSize
val testDataSize = trainDataSize * 3.0 / 7.0

and i calculate the trainDataSize and testDataSize for evaluate the model confidence

eliasah
  • 39,588
  • 11
  • 124
  • 154
Dylan Wang
  • 111
  • 8
  • I'm not sure that I understand what do you mean by sample balance. I've never heard of such thing. What are you trying to accomplish ? What is your data ? – eliasah Jul 01 '16 at 05:28
  • unbalanced training data will affects the classify model – Dylan Wang Jul 01 '16 at 06:33
  • That depends on the task at hand and the model you are trying to train. What you are saying isn't always true. – eliasah Jul 01 '16 at 06:34
  • yeah, you are right, and i need to balance the positive and negative. – Dylan Wang Jul 01 '16 at 06:37
  • ok but what is your data like ? is it an RDD a DataFrame ? what format ? – eliasah Jul 01 '16 at 06:44
  • 1
    have you taken a look at this ?http://stackoverflow.com/questions/32238727/stratified-sampling-in-spark/32241887#32241887 – eliasah Jul 01 '16 at 06:47
  • the 'dataList' in code is RDD[LabeledPoint], and rec.label = 1/0 means the sample is positive or negative. i want to find some smart way to optimize the code. – Dylan Wang Jul 01 '16 at 06:48
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/116163/discussion-between-eliasah-and-dylan-wang). – eliasah Jul 01 '16 at 07:00

1 Answers1

2

Ok I haven't tested this code, but it should go like this :

val data: RDD[LabeledPoint] = ???

val fractions: Map[Double, Double] = Map(0.0 -> 0.5, 1.0 -> 0.5)
val sampledData: RDD[LabeledPoint] = data
  .keyBy(_.label)
  .sampleByKeyExact(false, fractions)  // Optionally with seed
  .values

You can convert your LabeledPoint into PairRDDs than apply a sampleByKeyExact using the fractions you wish to use.

zero323
  • 322,348
  • 103
  • 959
  • 935
eliasah
  • 39,588
  • 11
  • 124
  • 154