2

Given a spark 2.3 DataFrame with a column containing JSON arrays, how can I convert those to Spark arrays of JSON strings? Or, equivalently, how can I explode the JSON, so that with an input of:

myJsonColumn
"""[{"a":1},{"b":2}]"""
"""[{"c":3},{"d":4}]"""
"""[{"e":3},{"f":4},{"g":{"h":1}}]"""

I get:

myExplodedJson
"""{"a":1}"""
"""{"b":2}"""
"""{"c":3}"""
"""{"d":4}"""
"""{"e":3}"""
"""{"f":4}"""
"""{"g":{"h":1}}"""

Many thanks!

p.s. the shape of the entries in the arrays is variable.

Here is a sample input, in case it is useful:

List("""[{"a":1},{"b":2}]""", """[{"c":3},{"d":4}]""", """[{"e":3},{"f":4},{"g":{"h":1}}]""").toDF.show

p.p.s. This is different from all the suggested duplicate questions so far. The question and solutions to e.g. How to query JSON data column using Spark DataFrames? apply when (1) the data is all JSON data, so the entire data frame can be read in as JSON. This does not apply to this question, where e.g. data can be CSV with a single column that contains JSON strings. This is also different from the case where the query is a lookup of a particular key, which is indeed covered by get_json_object. It is also different from the case where the number of elements in the array is bounded, so just looking up the first 100, say, array indices and then discarding nulls might be acceptable but even then that applies only when there are no nulls in the original JSON array.

p.p.s. What is needed is something like json_array_elements in postgres. My backstop is to write a UDF that takes the JSON string and returns a spark array but UDFs tend to be slower than builtins and exploding JSON is likely to be in core Spark functionality. Why reinvent the basics?

Max Murphy
  • 1,701
  • 1
  • 19
  • 29
  • 1
    What output do you want for {"g":{"h":1}? – Michael West Sep 17 '18 at 16:56
  • @Michael The JSON serialisation please. The g:{h example was to emphasise that the schema is not constant. :-). The keys change with the others but they can still be represented as a map. Each input is a JSON array of unspecified stuff. The corresponding output should be a spark array of encoded "stuff". – Max Murphy Sep 18 '18 at 07:03
  • 1
    @Michael Thank you for the links. I had already read both of those and neither solves the problem. I have asked Tzach whether there are any updates though in case he knows of a better method enabled by more recent functions. – Max Murphy Sep 18 '18 at 07:12

1 Answers1

0

You can use explode with array function.

In Spark 2.3.1, you can do -

val df = List("""[{"a":1},{"b":2}]""", """[{"c":3},{"d":4}]""", """[{"e":3},{"f":4},{"g":{"h":1}}]""").toDF

df.withColumn("value", explode(array(
  get_json_object($"value", "$[0]"),
  get_json_object($"value", "$[1]")
))).show

Output of this is -

+-------+
|  value|
+-------+
|{"a":1}|
|{"b":2}|
|{"c":3}|
|{"d":4}|
|{"e":3}|
|{"f":4}|
+-------+
moriarty007
  • 2,054
  • 16
  • 20
  • 1
    That is great when you always have the same number of elements in each array. What when there is a variable number of elements? There is a brute force approach - just get the first 100, hoping that there will never be more than 100, and then filter out nulls - but that has quite a few disadvantages. – Max Murphy Sep 18 '18 at 09:59
  • @MaxMurphy Will your data always have single key jsons? something like - "[{"a":1},{"b":2},{"p":5}, .... ] ? – moriarty007 Sep 18 '18 at 10:02
  • No. The structure of the array elements is variable. – Max Murphy Sep 18 '18 at 10:14