I have a table of distinct users, which has 400,000 users. I would like to split it into 4 parts, and expected each user located in one part only.
Here is my code:
val numPart = 4
val size = 1.0 / numPart
val nsizes = Array.fill(numPart)(size)
val data = userList.randomSplit(nsizes)
Then I write each data(i)
, i from 0 to 3, into parquet files. Select the directory, group by user id and count by part, there are some users that located in two or more parts.
I still have no idea why?