0

I have a dataset with the same id having multiple rows. I want to sample the rows per id, which mean I want to pick 15% of the records for each id. So for example my dataset looks like

id. ip
1 x
1 y
1 z
2 x
2 y

For each id I want to pick 15% of the ips. What would be the best way to do this?

1 Answers1

0

Utkarsh , what are trying to do is called Stratified Sampling in spark. There are direct methods available for sampling. Keyed sampling is possible for this as well

Spark SQL also has the sampleBy options too


sampleBy[T](col : _root_.scala.Predef.String, fractions : _root_.scala.Predef.Map[T, scala.Double], seed : scala.Long) : DataFrame
sampleBy[T](col : _root_.scala.Predef.String, fractions : java.util.Map[T, java.lang.Double], seed : scala.Long) : DataFrame
sampleBy[T](col : org.apache.spark.sql.Column, fractions : _root_.scala.Predef.Map[T, scala.Double], seed : scala.Long) : DataFrame
sampleBy[T](col : org.apache.spark.sql.Column, fractions : java.util.Map[T, java.lang.Double], seed : scala.Long) : DataFrame

You can reference : Stratified sampling in Spark and https://sparkbyexamples.com/spark/spark-sampling-with-examples/ for examples

Ramachandran.A.G
  • 4,788
  • 1
  • 12
  • 24