Efficient GroupBy/CombineBy in pyspark

Question

Currently I am using Spark (pyspark with Spark version 1.6) and I have a DataFrame like:

DataFrame[clientId: bigint, clientName: string, action: string, ...]

I want to dump it in S3 segregated by an attribute (e.g. clientId) in the following format s3://path/<clientId>/<datafiles>.

I want the datafiles to contain the rows for the corresponding clientId in json format, so for the path s3://path/1/, the datafiles will contain:

{"clientId":1, "clientName":"John Doe", "action":"foo", ...}
{"clientId":1, "clientName":"John Doe", "action":"bar", ...}
{"clientId":1, "clientName":"John Doe", "action":"baz", ...}

I was thinking on using groupBy then a toJSON but in DataFrames you can only collect the data and the DataFrame is too big to fit in the driver (also the I/O is massive). How can I save the partial results of the group from the executors?

score 0 · Accepted Answer · answered Apr 20 '18 at 12:09

0

Just partitionBy and write to JSON:

df.write.partitionBy("clientName").json(output_path)

You'll get structure

s3://path/clientId=some_id/<datafiles>

answered Apr 20 '18 at 12:09

user9674714

16

Efficient GroupBy/CombineBy in pyspark

1 Answers1