0

I stored into my MongoDB some JSON documents. Each document looks like: {"businessData":{"capacity":{"fuelCapacity":282}, ..}.

After reading all the documents, I want to export them as a valid JSON file. Specifically:

// Read JSON data from the DB
val df: DataFrame = MongoSpark.load(sparkSession, readConfig)
df.show
// Export into the file system
df.coalesce(1).write.mode(SaveMode.Overwrite).json("export.json")
// The show command only shows the .json values
+--------------------+
|        businessData|
+--------------------+
|[[282],0,[true,20...|
|[[280],0,[true,20...|
|[[290],0,[true,20...|
|[[292],0,[true,20...|
|[[282],16,[true,2...|
+--------------------+

// export.json
{"businessData":{"capacity":{"fuelCapacity":282}, ..}
{"businessData":{"capacity":{"fuelCapacity":280}, ..}
{"businessData":{"capacity":{"fuelCapacity":290}, ..}
{"businessData":{"capacity":{"fuelCapacity":292}, ..}
{"businessData":{"capacity":{"fuelCapacity":282}, ..}

But when I export to the file system I want to combine these 5 rows into an Array and also add some custom meta-data. For example:

{
  "metadata" : { "exportTime": "20/20/2020" , ...} 
  "allBusinessData" : [
    {"businessData":{"capacity":{"fuelCapacity":282}, ..},
    // all 5 rows from above
  ]
}

I have seen questions here and here advising against it. They also partially answer the question, as the don't add a custom json structure to the export.

Assuming however that this is only way I can proceed, how can I do it?

Many thanks!

user1485864
  • 499
  • 6
  • 18

1 Answers1

0

From Spark-2.2+:

You can try using to_json (or) create struct<array<...etc>> field in spark then write the df in json format to get required output.

  • for sample data i assumed exportedtime as current_timestamp()

Example:

val df=spark.read.json(Seq("""[{"businessData":{"capacity":{"fuelCapacity":282}}},{"businessData":{"capacity":{"fuelCapacity":456}}}""").toDS)

//creating a struct field called metadata and write data in json format.
df.selectExpr("struct(current_timestamp() as exporttime,struct(collect_list(businessData) as businessData)as allBusinessData) as metadata").write.format("json").mode("overwrite").save("json_path")

//using .to_json to create json object in dataframe
df.selectExpr("to_json(struct(current_timestamp() as exporttime,struct(collect_list(businessData) as businessData)as allBusinessData))metadata").show(false)

//+-------------------------------------------------------------------------------------------------------------------------------------------------------+
//|metadata                                                                                                                                               |
//+-------------------------------------------------------------------------------------------------------------------------------------------------------+
//|{"exporttime":"2020-03-21T15:17:54.769-05:00","allBusinessData":{"businessData":[{"capacity":{"fuelCapacity":282}},{"capacity":{"fuelCapacity":456}}]}}|
//+-------------------------------------------------------------------------------------------------------------------------------------------------------+

//using  .toJSON to view json in shell(non-prod use only)
df.selectExpr("struct(current_timestamp() as exporttime,struct(collect_list(businessData) as businessData)as allBusinessData)metadata").toJSON.collect()

//Array[String] = Array({"metadata":{"exporttime":"2020-03-21T15:19:35.890-05:00","allBusinessData":{"businessData":[{"capacity":{"fuelCapacity":282}},{"capacity":{"fuelCapacity":456}}]}}})
notNull
  • 30,258
  • 4
  • 35
  • 50
  • Hello, I have to read using the MongoSpark driver (``MongoSpark.load``), which does not have a ``.toDS`` method. I still tried your solution, it almost answers my question. I get: ``"allBusinessData":{``instead of the desired ``"allBusinessData":[{...`` (note the opening quare bracket **[**) – user1485864 Mar 23 '20 at 12:54
  • @user1485864, U don't have to worry about `.toDS`... for sample data I used **`.toDS`** to read as **`.json`**. for your case your df is in json format. – notNull Mar 23 '20 at 13:36