I have a dataframe in Spark 2 as shown below where users have between 50 to thousands of posts. I would like to create a new dataframe that will have all the users in the original dataframe but with only 5 randomly sampled posts for each user.
+--------+--------------+--------------------+
| user_id| post_id| text|
+--------+--------------+--------------------+
|67778705|44783131591473|some text...........|
|67778705|44783134580755|some text...........|
|67778705|44783136367108|some text...........|
|67778705|44783136970669|some text...........|
|67778705|44783138143396|some text...........|
|67778705|44783155162624|some text...........|
|67778705|44783688650554|some text...........|
|68950272|88655645825660|some text...........|
|68950272|88651393135293|some text...........|
|68950272|88652615409812|some text...........|
|68950272|88655744880460|some text...........|
|68950272|88658059871568|some text...........|
|68950272|88656994832475|some text...........|
+--------+--------------+--------------------+
Something like posts.groupby('user_id').agg(sample('post_id'))
but there is no such function in pyspark.
Any advice?
Update:
This question is different from another closely related question stratified-sampling-in-spark in two ways:
- It asks about disproportionate stratified sampling rather than the common proportionate method in the other question above.
- It asks about doing this in Spark's Dataframe API rather than RDD.
I have also updated the question's title to clarify this.