My input dataframe looks like this:
+----------+-------+-------+
| timestamp| weight| id|
+----------+-------+-------+
|01-01-2022| 123| abc123|
|02-02-2022| 456| def456|
|03-03-2022| 789| ghi789|
+----------+-------+-------+
The goal is to write this dataframe records into an .json file with the following format
{"summaries":[{"id":"abc123","timestamp":"01-01-2022","weight":123},{"id":"def456","timestamp":"02-02-2022","weight":456},{"id":"ghi789","timestamp":"03-03-2022","weight":789}],"status":200}
Therefore I want my dataframe to come out like this in order to write it to the json file:
+--------------------------------------------------------+-------+
| summaries| status|
+--------------------------------------------------------+-------+
|[{"timestamp":"01-01-2022", "weight":123, "id":"abc123"},
{"timestamp":"01-01-2022", "weight":456, "id":"def456"},
{"timestamp":"01-01-2022", "weight":789, "id":"ghi789"}} | 200|
+--------------------------------------------------------+--------+
I've created a starting point of my dataframe:
data = [('01-01-2022', 123, 'abc123'), ('02-02-2022', 456, 'def456'), ('03-03-2022', 789, 'ghi789')]
columns = ["timestamp", "weight", "id"]
df = spark.createDataFrame(data, columns)
I have 2 strategies that I tried
1.
dfConvert = (df.withColumn("summaries", struct("timestamp", "weight", "id")))
However, from there I have difficulties on how to concatenate the records in one record, and adding the 'status' column.
2.
# make rows from the dataframe
rows = df.rdd.map(lambda row: row.asDict()).collect()#print(rows)
dfConvert = spark.createDataFrame([(rows, "200")],["summaries", "status"])
However, with this strategy I am writing to the memory, which I want to avoid, as later on in the process I will have large data sets and this code is less-performant than the withColumn
method.
NOTE:
The groupBy
method won't work, because i will have duplicated records in each of the columns
Then the writing is going successfully
dfConvert.write.format('json').mode("overwrite").save("MyDocuments/write_path")