29

I have a dataframe which I want to write it as single json file with a specific name. I tried below

df2 = df1.select(df1.col1,df1.col2)
df2.write.format('json').save('/path/file_name.json') # didnt work, writing in folder 'file_name.json' and files with part-XXX
df2.toJSON().saveAsTextFile('/path/file_name.json')  # didnt work, writing in folder 'file_name.json' and files with part-XXX

Appreciate if some one can provide a solution.

Lijju Mathew
  • 1,911
  • 6
  • 20
  • 26

4 Answers4

30

You need to save this on single file using below code:-

df2 = df1.select(df1.col1,df1.col2)
df2.coalesce(1).write.format('json').save('/path/file_name.json')

This will make a folder with file_name.json. Check this folder you can get a single file with whole data part-000

Rakesh Kumar
  • 4,319
  • 2
  • 17
  • 30
  • 9
    I wanted to write with a specific name file_name.json. Is there a direct way of writing it, other than renaming it ? – Lijju Mathew Apr 07 '17 at 12:28
  • 1
    instead writing this file_name.json use your name like file_name – Rakesh Kumar Apr 07 '17 at 12:32
  • 1
    Because you are using spark, your data is spread across multiple nodes, computing in parallel and sent in part to your directory. One of the reasons to use spark is that the data cannot be stored locally. So this is how the data is output. The larger your file the larger more "part" files should come through. – lwileczek Oct 31 '17 at 14:34
  • @LijjuMathew: This should be what you are looking for : https://stackoverflow.com/a/60442604/530399 – Bikash Gyawali Apr 19 '21 at 12:43
6

You can do it by converting to a pandas df previously:

df.toPandas().to_json('path/file_name.json', orient='records', force_ascii=False, lines=True)
fedosique
  • 86
  • 2
  • 2
2

Pyspark stores the files in smaller chunks and as far as I know, we can not store the JSON directly with a single given file name. I think this small python function will be helpful to what you're trying to achieve.

def saveResult (data_frame, temp_location, file_path):
    data_frame.write.mode('append').json(temp_location)
    file = dbutils.fs.ls(temp_location)[-1].path # last file is the json or can also use regex to determine this
    dbutils.fs.cp(file, file_path)
    dbutils.fs.rm(temp_location, recurse=True)

Basically, what's happening here is you are passing the data frame, the temp_location where all the file chunks are stored and the full file path (file path + filename) which you'd like to get as an output file. The function generates the chunks, deletes all the chunks, and saves the final file into the desired location with the desired file name.

-2

Here's another approach:

import os
df2 = df1.select(df1.col1,df1.col2)
df2.write.format('json').save('/path/folder_name')

os.system("cat /path/folder_name/*.json > /path/df.json")
os.system("rm -rf /path/folder_name")

Assuming this is done in the analysis phase and the exporting as a single json doesn't get carried into prod.