Efficient Spark JSON Transformation

Question

I am trying to do a relatively simple task in Spark, but it's quickly becoming quite painful. I am parsing JSON, and want to update a field after the JSON has been parsed. I want to do this after the parsing, since the JSON is complicated(nested) with many elements.

"attributes" -> 
   "service1" ->
   "service2" ->
   ...
"keyId"

However, that approach seems just as complicated. The generated Row does not seem to know the columns outside of the top level ones("attributes"/"keyId"). So for example, I cannot seem to do withColumn because the top level row does not see it.

jsonDf.map((parsedJson: Row) => {
      val targetFieldToReplace = parsedJson.getAs[Row](0).getList[Row](2).get(0).getAs[String](0)
      ????
    })

I am able to extract the value, but I don't know how to put it back. I've thought about converting everything into a Sequence, but that is doesn't seem like a good idea because it will flatten the nested structure. I could re-create the Row with each element one by one, but at that point, it seems wrong. What am I missing here?

Could you add `json` for example and action you want to perform? Nested fields should be accessible in the following way `outer.inner` — Gelerion, Sep 29 '19 at 04:57
What if instead of loading it as json to load it as text. Then you do the json processing at the row level as shown here https://stackoverflow.com/questions/58037893/read-external-json-file-into-rdd-and-extract-specific-values-in-scala/58043151#58043151. Finally you can generate/return a tuple or Row from the map function. — abiratsis, Sep 29 '19 at 10:48
You can try flatten the json Please see [this answer](https://stackoverflow.com/questions/34271398/flatten-nested-spark-dataframe) — roizaig, Nov 27 '19 at 09:34

Efficient Spark JSON Transformation

0 Answers0