I want to do stratified sampling on my dataframe in Scala. My dataframe has only one column and I want to form a fraction map for it. I am able to do it in pyspark but it gives me error in Scala. Here is what I tried in Scala:
import org.apache.spark.sql.functions.{lit}
val fractions = pqdf.select("vin").distinct().withColumn("fraction", lit(0.001)).rdd.collect().toMap
It errors out saying:
Error:(25, 100) Cannot prove that org.apache.spark.sql.Row <:< (T, U).
val fractions = pqdf.select("vin").distinct().withColumn("fraction", lit(0.001)).rdd.collect().toMap
How do I resolve it? I want to use the fraction map created above in .samplyBy
method as one of the parameters
val sampled_df = pqdf.stat.sampleBy("vin", fractions, 10L)
This is what I tried in pyspark which works:
from pyspark.sql.functions import lit
fractions = df.select("VIN").distinct().withColumn("fraction", lit(0.001)).rdd.collectAsMap()
# fractions
sampled_df = df.stat.sampleBy("VIN", fractions, 10)
I am not sure how do I achieve same thing in Scala.