0

I have a single column dataframe newDf as below:

+------------+
|       value|
+------------+
|5TEJU62N58Z4|
|000000000000|
|1J4GW48SX4C3|
|1J4GW68S2XC7|
|1J4GK48K04W1|

It have 486 rows. I want to do stratified sampling on this dataframe. For that, I would first need to create a fraction map and then pass it as a argument in the sampleBy method. This is what I am trying:

val fractions = newDf.distinct.map(x => (x,0.8)).collect().toMap
val sampled_df = newDf.stat.sampleBy("value", fractions, 10L)

But it errors out saying this:

Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row
- field (class: "org.apache.spark.sql.Row", name: "_1")
- root class: "scala.Tuple2"

I also tried preparing fraction like this:

val fractions = newDf.map(_._1).distinct.map(x => (x,0.8)).collectAsMap()

But it shows me error stating that

Error:(32, 33) value _1 is not a member of org.apache.spark.sql.Row
    val fractions = newDf.map(_._1).distinct.map(x => (x,0.8)).collectAsMap()

How can I prepare this fraction map so that I can use it in the below sampleBy method and do the sampling?

CodeHunter
  • 2,017
  • 2
  • 21
  • 47

1 Answers1

1

How about simple

newDf.distinct.as[String].collect.map((_, 0.8)).toMap