pyspark df.write.json('s3e://somepath') is binary

Question

i'm using pyspark, I want to write the results to json, however when I use

df.write.json("s3e://somepath") then I get resulting json as: part-00000-sdfh837fjh-6f8a-44d1-b0bb-sdjfh9236dj-c000.json

the commands that create my df are similar to following:

import json 
from pyspark.sql.functions import *
from pyspark.sql.types import *

rdd = sc.parallelize([(1,2,3),(4,5,6),(7,8,9)])
df = rdd.toDF(["a","b","c"])

resultrdd = df.rdd.map(lambda x: ({"x": {"y": x.a}, "xx" + "yy": {"yy" + "yy": x.b}}))
resultdf = resultrdd.toDF()

resultdf.write.json("s3e://mybucket/testingjson") # and the resulting files are binary and not json files. why? how can i fix it??

resultrdd.collect()
resultdf.printSchema()

And when I open the files in resulting s3e://mybucket/testingjson they are binary and cannot be opened with text editor. Why is that and how can I have the df.write.json create actual json files?

Note the printed scheme is as following:

root
 |-- x: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- xxyy: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)

Note if I print dataframe I get (to verify what the json contains):

resultdf
[{'x': {'y': 1}, 'xxyy': {'yyyy': 2}},
 {'x': {'y': 4}, 'xxyy': {'yyyy': 5}},
 {'x': {'y': 7}, 'xxyy': {'yyyy': 8}}]

score 2 · Accepted Answer · answered Dec 04 '18 at 15:54

Check the S3 API you are using. Is it a typo "s3e"? Try with

resultdf.write.json("s3a://mybucket/testingjson")

Also if its a small dataset, you can coalesce into single file.

resultdf.coalesce(1).write.json("s3a://mybucket/testingjson")

More details Technically what is the difference between s3n, s3a and s3?

pyspark df.write.json('s3e://somepath') is binary

1 Answers1