I have a single column dataframe newDf
as below:
+------------+
| value|
+------------+
|5TEJU62N58Z4|
|000000000000|
|1J4GW48SX4C3|
|1J4GW68S2XC7|
|1J4GK48K04W1|
It have 486 rows. I want to do stratified sampling on this dataframe. For that, I would first need to create a fraction map and then pass it as a argument in the sampleBy
method. This is what I am trying:
val fractions = newDf.distinct.map(x => (x,0.8)).collect().toMap
val sampled_df = newDf.stat.sampleBy("value", fractions, 10L)
But it errors out saying this:
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row
- field (class: "org.apache.spark.sql.Row", name: "_1")
- root class: "scala.Tuple2"
I also tried preparing fraction like this:
val fractions = newDf.map(_._1).distinct.map(x => (x,0.8)).collectAsMap()
But it shows me error stating that
Error:(32, 33) value _1 is not a member of org.apache.spark.sql.Row
val fractions = newDf.map(_._1).distinct.map(x => (x,0.8)).collectAsMap()
How can I prepare this fraction map so that I can use it in the below sampleBy
method and do the sampling?