0

I would like to partitioning a dataframe in a stratified way. That is, the dataframe has a column with a lots of zeros and just a few ones values. And I would like to partition it keeping the ratio between zeros and ones using a Custom Partitioner, but I don't know how can I do it.

Here Stratified sampling with pyspark and here Stratified sampling in Spark I have found similar situations but using sampling instead partitioning. Any idea? This is the first time I'm trying to partitionate the data in a custom way. I'm using Spark + Scala + Dataframes

mjbsgll
  • 722
  • 9
  • 24
  • check this answer- https://stackoverflow.com/a/50476540/4758823 – Som Jun 14 '20 at 15:22
  • This answer is similar what I want, but split data into two dataframe. In my case, I want to partitioning data, so each worker/node recieves data with similar proportion between class labels. – mjbsgll Jun 14 '20 at 15:33

0 Answers0