1

Currently I am using Spark (pyspark with Spark version 1.6) and I have a DataFrame like:

DataFrame[clientId: bigint, clientName: string, action: string, ...]

I want to dump it in S3 segregated by an attribute (e.g. clientId) in the following format s3://path/<clientId>/<datafiles>.

I want the datafiles to contain the rows for the corresponding clientId in json format, so for the path s3://path/1/, the datafiles will contain:

{"clientId":1, "clientName":"John Doe", "action":"foo", ...}
{"clientId":1, "clientName":"John Doe", "action":"bar", ...}
{"clientId":1, "clientName":"John Doe", "action":"baz", ...}

I was thinking on using groupBy then a toJSON but in DataFrames you can only collect the data and the DataFrame is too big to fit in the driver (also the I/O is massive). How can I save the partial results of the group from the executors?

carlescere
  • 43
  • 1
  • 4

1 Answers1

0

Just partitionBy and write to JSON:

df.write.partitionBy("clientName").json(output_path)

You'll get structure

s3://path/clientId=some_id/<datafiles>