2

I am trying to do a relatively simple task in Spark, but it's quickly becoming quite painful. I am parsing JSON, and want to update a field after the JSON has been parsed. I want to do this after the parsing, since the JSON is complicated(nested) with many elements.

"attributes" -> 
   "service1" ->
   "service2" ->
   ...
"keyId"

However, that approach seems just as complicated. The generated Row does not seem to know the columns outside of the top level ones("attributes"/"keyId"). So for example, I cannot seem to do withColumn because the top level row does not see it.

jsonDf.map((parsedJson: Row) => {
      val targetFieldToReplace = parsedJson.getAs[Row](0).getList[Row](2).get(0).getAs[String](0)
      ????
    })

I am able to extract the value, but I don't know how to put it back. I've thought about converting everything into a Sequence, but that is doesn't seem like a good idea because it will flatten the nested structure. I could re-create the Row with each element one by one, but at that point, it seems wrong. What am I missing here?

stan
  • 4,885
  • 5
  • 49
  • 72
  • Could you add `json` for example and action you want to perform? Nested fields should be accessible in the following way `outer.inner` – Gelerion Sep 29 '19 at 04:57
  • What if instead of loading it as json to load it as text. Then you do the json processing at the row level as shown here https://stackoverflow.com/questions/58037893/read-external-json-file-into-rdd-and-extract-specific-values-in-scala/58043151#58043151. Finally you can generate/return a tuple or Row from the map function. – abiratsis Sep 29 '19 at 10:48
  • You can try flatten the json Please see [this answer](https://stackoverflow.com/questions/34271398/flatten-nested-spark-dataframe) – roizaig Nov 27 '19 at 09:34

0 Answers0