pyspark dataframe to valid json

Question

Im trying to convert a dataframe to a valid json format, howeever I have not succeeded yet.

if I do like this:

fullDataset.repartition(1).write.json(f'{mount_point}/eds_ckan', mode='overwrite', ignoreNullFields=False)

I only get row based json like this:

{"col1":"2021-10-09T12:00:00.000Z","col2":336,"col3":0.0}
{"col1":"2021-10-16T20:00:00.000Z","col2":779,"col3":6965.396}
{"col1":"2021-10-17T12:00:00.000Z","col2":350,"col3":0.0}

Does anyone know how to convert it to valid json which is not row based?

can you give an exampe of what your expected output would look like? — d-xa, Feb 24 '22 at 08:54
Could you please refer this : https://stackoverflow.com/questions/53426420/pyspark-how-to-convert-a-spark-dataframe-to-json-and-save-it-as-json-file & https://stackoverflow.com/questions/58238563/write-spark-dataframe-as-array-of-json-pyspark — AjayKumarGhose, Feb 24 '22 at 10:23

score 2 · Accepted Answer · answered Mar 01 '22 at 13:31

Below is the sample example on converting dataframe to valid Json

Try using Collect and then using json.dump

import json
collected_df = df_final.collect()
with open(data_output_file + 'createjson.json', 'w') as outfile:
    json.dump(data, outfile)

here are few links with the related discussions you can go through for complete information.

Dataframe to valid JSON

Valid JSON in spark

pyspark dataframe to valid json

1 Answers1