Pyspark random split changes distribution of data

Question

I found a very strange behavior with pyspark when I use randomSplit. I have a column is_clicked that takes values 0 or 1 and there are way more zeros than ones. After random split I would expect the data would be uniformly distributed. But instead, I see that the first rows in the splits are all is_cliked=1, followed by rows that are all is_clicked=0. You can see that number of clicks in the original dataframe df is 9 out of 1000 (which is what I expect). But after random split the number of clicks is 1000 out of 1000. If I take more rows I will see that it's all going to be is_clicked=1 until there are no more columns like this, and then it will be followed by rows is_clicked=0.

Anyone knows why there is distribution change after random split? How can I make is_clicked be uniformly distributed after split?

it is "random" split. It only takes care of splitting records to the specified weight regardless of column values. Please take a look at https://stackoverflow.com/questions/47637760/stratified-sampling-with-pyspark for stratified sampling. — Emma, Aug 02 '22 at 19:52
Maybe I was not clear enough, but I do need random splitting, and it's not. For some reason, it gives me the first rows to be the ones with `is_clicked=1`, while it should not be the case. — Sergey Ivanov, Aug 03 '22 at 09:49
Spark dataframe is distributed across multiple partitions and not ordered. unless with `sort` or `orderBy`, you won't guarantee to have same ordering after a function. Maybe that's what is seeing? — Emma, Aug 03 '22 at 13:44
So, if your original dataframe has `is_clicked=1` more than 1000, it is *possible* to be like your result. Do you know how many records you have in total and how many are `is_clicked=1`? — Emma, Aug 03 '22 at 13:59

Sergey Ivanov · Answer 1 · 2022-08-03T15:53:19.737

So indeed pyspark does sort the data, when does randomSplit. Here is a quote from the code:

It is possible that the underlying dataframe doesn't guarantee the ordering of rows in its constituent partitions each time a split is materialized which could result in overlapping splits. To prevent this, we explicitly sort each input partition to make the ordering deterministic. Note that MapTypes cannot be sorted and are explicitly pruned out from the sort order.

The solution to this either reshuffle the data after the split or just use filter instead of randomSplit:

Solution 1:

df = df.withColumn('rand', sf.rand(seed=42)).orderBy('rand')
df_train, df_test = df.randomSplit([0.5, 0.5])
df_train.orderBy('rand')

Solution 2:

df_train = df.filter(df.rand < 0.5)
df_test = df.filter(df.rand >= 0.5)

Here is a blog post with more details.

oh very interesting! I didn't know `randomSplit` works in that way. Nice medium post as well! Thank you. — Emma, Aug 03 '22 at 16:01

Pyspark random split changes distribution of data

1 Answers1