Currently I am using Spark (pyspark with Spark version 1.6) and I have a DataFrame like:
DataFrame[clientId: bigint, clientName: string, action: string, ...]
I want to dump it in S3 segregated by an attribute (e.g. clientId
) in the following format s3://path/<clientId>/<datafiles>
.
I want the datafiles
to contain the rows for the corresponding clientId
in json format, so for the path s3://path/1/
, the datafiles will contain:
{"clientId":1, "clientName":"John Doe", "action":"foo", ...}
{"clientId":1, "clientName":"John Doe", "action":"bar", ...}
{"clientId":1, "clientName":"John Doe", "action":"baz", ...}
I was thinking on using groupBy
then a toJSON
but in DataFrames you can only collect the data and the DataFrame is too big to fit in the driver (also the I/O is massive). How can I save the partial results of the group from the executors?