I have a PySpark dataframe that contains records for 6 million people, each with an individual userid
. Each userid
has 2000 entries. I want to save my each userid
's data into a separate csv file with the userid
as the name.
I have some code that does this, taken from the solution to this question. However, as I understand it the code will try to partition each of the 6 million ids. I don't actually care about this as I'm going to write each of these files to another non-HDFS server.
I should note that the code works for a small number of userids
(up to 3000) but it fails on the full 6 million.
Code:
output_file = '/path/to/some/hdfs/location'
myDF.write.partitionBy('userid').mode('overwrite').format("csv").save(output_file)
When I run the above it takes WEEKS to run with most of that time spent on the writing step. I assume this is because of the number of partitions. Even if I manually specify the number of partitions to something small it still takes ages to execute.
Question: Is there a way to save each of the userids
data into a single, well named (name of file = userid
) file without partitioning?