6

TL;DR: When I dump a Spark DataFrame as json, I always end up with something like

{"key1": "v11", "key2": "v21"}
{"key1": "v12", "key2": "v22"}
{"key1": "v13", "key2": "v23"}

which is invalid json. I can manually edit the dumped file to get something I can parse:

[
  {"key1": "v11", "key2": "v21"},
  {"key1": "v12", "key2": "v22"},
  {"key1": "v13", "key2": "v23"}
]

but I'm pretty sure I'm missing something that would let me avoid this manual edit. I just don't now what.

More details:

I have a org.apache.spark.sql.DataFrame and I try dumping it to json using the following code:

myDataFrame.write.json("file.json")

I also tried with:

myDataFrame.toJSON.saveAsTextFile("file.json")

In both case it ends up dumping correctly each row, but it's missing a separating comma between the rows, and as well as square brackets. Consequently, when I subsequently try to parse this file the parser I use insults me and then fails.

I would be grateful to learn how I can dump valid json. (reading the documentation of the DataFrameWriter didn't provided me with any interesting hints.)

gturri
  • 13,807
  • 9
  • 40
  • 57

1 Answers1

2

This is an expected output. Spark uses JSON Lines-like format for a number of reasons:

  • It can parsed and loaded in parallel.
  • Parsing can be done without loading full file in memory.
  • It can be written in parallel.
  • It can be written without storing complete partition in memory.
  • Is valid input even if file is empty.
  • Finally Row in Spark is struct which maps to JSON object not array.
  • ...

You can create desired output in a few ways, but it will always conflict with one of the above.

You can for example write a single JSON document for each partition:

import org.apache.spark.sql.functions._

df
  .groupBy(spark_partition_id)
  .agg(collect_list(struct(df.columns map col: _*)).alias("data"))
  .select($"data")
  .write
  .json(output_path)

You could prepend this with repartition(1) to get a single output file, but it is not something you want to do, unless data is very small.

1.6 alternative would be glom

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val newSchema = StructType(Seq(StructField("data", ArrayType(df.schema))))

sqlContext.createDataFrame(
  df.rdd.glom.flatMap(a => if(a.isEmpty) Seq() else Seq(Row(a))), 
  newSchema
)
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • Yeah. This won't work with such outdated version. a) https://stackoverflow.com/q/35528966/8371915 b) Because it doesn't support aggregations on structs. You'll have to use RDD for that. – Alper t. Turker Jan 30 '18 at 10:56