Spark DataFrame serialized as invalid json

Question

TL;DR: When I dump a Spark DataFrame as json, I always end up with something like

{"key1": "v11", "key2": "v21"}
{"key1": "v12", "key2": "v22"}
{"key1": "v13", "key2": "v23"}

which is invalid json. I can manually edit the dumped file to get something I can parse:

[
  {"key1": "v11", "key2": "v21"},
  {"key1": "v12", "key2": "v22"},
  {"key1": "v13", "key2": "v23"}
]

but I'm pretty sure I'm missing something that would let me avoid this manual edit. I just don't now what.

More details:

I have a org.apache.spark.sql.DataFrame and I try dumping it to json using the following code:

myDataFrame.write.json("file.json")

I also tried with:

myDataFrame.toJSON.saveAsTextFile("file.json")

In both case it ends up dumping correctly each row, but it's missing a separating comma between the rows, and as well as square brackets. Consequently, when I subsequently try to parse this file the parser I use insults me and then fails.

I would be grateful to learn how I can dump valid json. (reading the documentation of the DataFrameWriter didn't provided me with any interesting hints.)

Alper t. Turker · Accepted Answer · 2018-01-30T11:28:14.960

This is an expected output. Spark uses JSON Lines-like format for a number of reasons:

It can parsed and loaded in parallel.
Parsing can be done without loading full file in memory.
It can be written in parallel.
It can be written without storing complete partition in memory.
Is valid input even if file is empty.
Finally Row in Spark is struct which maps to JSON object not array.
...

You can create desired output in a few ways, but it will always conflict with one of the above.

You can for example write a single JSON document for each partition:

import org.apache.spark.sql.functions._

df
  .groupBy(spark_partition_id)
  .agg(collect_list(struct(df.columns map col: _*)).alias("data"))
  .select($"data")
  .write
  .json(output_path)

You could prepend this with repartition(1) to get a single output file, but it is not something you want to do, unless data is very small.

1.6 alternative would be glom

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val newSchema = StructType(Seq(StructField("data", ArrayType(df.schema))))

sqlContext.createDataFrame(
  df.rdd.glom.flatMap(a => if(a.isEmpty) Seq() else Seq(Row(a))), 
  newSchema
)

Yeah. This won't work with such outdated version. a) https://stackoverflow.com/q/35528966/8371915 b) Because it doesn't support aggregations on structs. You'll have to use RDD for that. — Alper t. Turker, Jan 30 '18 at 10:56

Spark DataFrame serialized as invalid json

1 Answers1

Linked