Given a spark 2.3 DataFrame with a column containing JSON arrays, how can I convert those to Spark arrays of JSON strings? Or, equivalently, how can I explode the JSON, so that with an input of:
myJsonColumn
"""[{"a":1},{"b":2}]"""
"""[{"c":3},{"d":4}]"""
"""[{"e":3},{"f":4},{"g":{"h":1}}]"""
I get:
myExplodedJson
"""{"a":1}"""
"""{"b":2}"""
"""{"c":3}"""
"""{"d":4}"""
"""{"e":3}"""
"""{"f":4}"""
"""{"g":{"h":1}}"""
Many thanks!
p.s. the shape of the entries in the arrays is variable.
Here is a sample input, in case it is useful:
List("""[{"a":1},{"b":2}]""", """[{"c":3},{"d":4}]""", """[{"e":3},{"f":4},{"g":{"h":1}}]""").toDF.show
p.p.s. This is different from all the suggested duplicate questions so far. The question and solutions to e.g. How to query JSON data column using Spark DataFrames? apply when (1) the data is all JSON data, so the entire data frame can be read in as JSON. This does not apply to this question, where e.g. data can be CSV with a single column that contains JSON strings. This is also different from the case where the query is a lookup of a particular key, which is indeed covered by get_json_object. It is also different from the case where the number of elements in the array is bounded, so just looking up the first 100, say, array indices and then discarding nulls might be acceptable but even then that applies only when there are no nulls in the original JSON array.
p.p.s. What is needed is something like json_array_elements in postgres. My backstop is to write a UDF that takes the JSON string and returns a spark array but UDFs tend to be slower than builtins and exploding JSON is likely to be in core Spark functionality. Why reinvent the basics?