Pyspark dataframe write to single json file with specific name

Question

I have a dataframe which I want to write it as single json file with a specific name. I tried below

df2 = df1.select(df1.col1,df1.col2)
df2.write.format('json').save('/path/file_name.json') # didnt work, writing in folder 'file_name.json' and files with part-XXX
df2.toJSON().saveAsTextFile('/path/file_name.json')  # didnt work, writing in folder 'file_name.json' and files with part-XXX

Appreciate if some one can provide a solution.

score 30 · Answer 1 · answered Apr 07 '17 at 05:30

30

You need to save this on single file using below code:-

df2 = df1.select(df1.col1,df1.col2)
df2.coalesce(1).write.format('json').save('/path/file_name.json')

This will make a folder with file_name.json. Check this folder you can get a single file with whole data part-000

answered Apr 07 '17 at 05:30

Rakesh Kumar

4,319
2
17
30

9

I wanted to write with a specific name file_name.json. Is there a direct way of writing it, other than renaming it ? – Lijju Mathew Apr 07 '17 at 12:28
1

instead writing this file_name.json use your name like file_name – Rakesh Kumar Apr 07 '17 at 12:32
1

Because you are using spark, your data is spread across multiple nodes, computing in parallel and sent in part to your directory. One of the reasons to use spark is that the data cannot be stored locally. So this is how the data is output. The larger your file the larger more "part" files should come through. – lwileczek Oct 31 '17 at 14:34
@LijjuMathew: This should be what you are looking for : https://stackoverflow.com/a/60442604/530399 – Bikash Gyawali Apr 19 '21 at 12:43

score 6 · Answer 2 · answered Jun 12 '20 at 12:13

6

You can do it by converting to a pandas df previously:

df.toPandas().to_json('path/file_name.json', orient='records', force_ascii=False, lines=True)

answered Jun 12 '20 at 12:13

fedosique

86
2
2

score 2 · Answer 3 · answered Jun 16 '20 at 21:39

Pyspark stores the files in smaller chunks and as far as I know, we can not store the JSON directly with a single given file name. I think this small python function will be helpful to what you're trying to achieve.

def saveResult (data_frame, temp_location, file_path):
    data_frame.write.mode('append').json(temp_location)
    file = dbutils.fs.ls(temp_location)[-1].path # last file is the json or can also use regex to determine this
    dbutils.fs.cp(file, file_path)
    dbutils.fs.rm(temp_location, recurse=True)

Basically, what's happening here is you are passing the data frame, the temp_location where all the file chunks are stored and the full file path (file path + filename) which you'd like to get as an output file. The function generates the chunks, deletes all the chunks, and saves the final file into the desired location with the desired file name.

score -2 · Answer 4 · answered Oct 11 '20 at 15:40

Here's another approach:

import os
df2 = df1.select(df1.col1,df1.col2)
df2.write.format('json').save('/path/folder_name')

os.system("cat /path/folder_name/*.json > /path/df.json")
os.system("rm -rf /path/folder_name")

Assuming this is done in the analysis phase and the exporting as a single json doesn't get carried into prod.

Pyspark dataframe write to single json file with specific name

4 Answers4

Linked