0

Im trying to convert a dataframe to a valid json format, howeever I have not succeeded yet.

if I do like this:

fullDataset.repartition(1).write.json(f'{mount_point}/eds_ckan', mode='overwrite', ignoreNullFields=False)

I only get row based json like this:

{"col1":"2021-10-09T12:00:00.000Z","col2":336,"col3":0.0}
{"col1":"2021-10-16T20:00:00.000Z","col2":779,"col3":6965.396}
{"col1":"2021-10-17T12:00:00.000Z","col2":350,"col3":0.0}

Does anyone know how to convert it to valid json which is not row based?

RahulKumarShaw
  • 4,192
  • 2
  • 5
  • 11
SqlKindaGuy
  • 3,501
  • 2
  • 12
  • 29
  • can you give an exampe of what your expected output would look like? – d-xa Feb 24 '22 at 08:54
  • 1
    Could you please refer this : https://stackoverflow.com/questions/53426420/pyspark-how-to-convert-a-spark-dataframe-to-json-and-save-it-as-json-file & https://stackoverflow.com/questions/58238563/write-spark-dataframe-as-array-of-json-pyspark – AjayKumarGhose Feb 24 '22 at 10:23

1 Answers1

2

Below is the sample example on converting dataframe to valid Json

Try using Collect and then using json.dump

import json
collected_df = df_final.collect()
with open(data_output_file + 'createjson.json', 'w') as outfile:
    json.dump(data, outfile)

here are few links with the related discussions you can go through for complete information.

Dataframe to valid JSON

Valid JSON in spark

SaiSakethGuduru
  • 2,218
  • 1
  • 5
  • 15