I found a very strange behavior with pyspark when I use randomSplit
. I have a column is_clicked
that takes values 0
or 1
and there are way more zeros than ones. After random split I would expect the data would be uniformly distributed. But instead, I see that the first rows in the splits are all is_cliked=1
, followed by rows that are all is_clicked=0
. You can see that number of clicks in the original dataframe df
is 9 out of 1000 (which is what I expect). But after random split the number of clicks is 1000 out of 1000. If I take more rows I will see that it's all going to be is_clicked=1
until there are no more columns like this, and then it will be followed by rows is_clicked=0
.
Anyone knows why there is distribution change after random split? How can I make is_clicked
be uniformly distributed after split?