Use Spark to parse a JSON array column and carry ID column along as well

Question

I have a table with two columns -- one called json, a string containing a JSON array, and the other an Int called id. If I select just the json column and convert the resulting DataFrame to a Dataset, then parse the json using Reader.json(), the JSON array in the resulting DataFrame will be exploded out into multiple rows. Now there's no way I can see to correlate id back to the correct exploded rows. In Java this looks something like:

Dataset<Row> df = reader.table("myTable"); // DataFrame containing json and id
df = df.selectExpr("json");                // DataFrame containing just json
Dataset<String> ds = df.toJavaRDD()        // Dataset<String> containing just json
    .map((Function<Row, String>) row -> (String) row.get(0)).rdd(), STRING());
df = reader.schema(jsonDdl).json(ds);      // exploded DataFrame, one row per array element

...now what? I have no way to correlate id back to the rows coming from the JSON parse operation. How can I do this differently such that the relationship between id and the parsed JSON columns is preserved?

select `df=df.selectExpr("json","id")` then in `jsonDdl` include id column too! — notNull, Apr 14 '20 at 15:39
@Shu thanks, but won't that just cause Spark to look for the `id` column inside the JSON string? — nclark, Apr 14 '20 at 16:06
@Shu the other problem with that is `DataFrameReader.json()` will accept a `Dataset` in Java, or an `RDD` (deprecated) but not a `DataFrame` (a.k.a. `Dataset`), so there's no way to sneak an extra column in there alongside the JSON string. — nclark, Apr 14 '20 at 16:26
Does this answer your question? [How to query JSON data column using Spark DataFrames?](https://stackoverflow.com/questions/34069282/how-to-query-json-data-column-using-spark-dataframes) — user10938362, Apr 14 '20 at 18:12
Can you supply a data representation. Trying to understand why `DataFrameReader.json()` would cause an explosion _and_ if we can do a pre-processing step to introduce `id` in json itself. — D3V, Apr 14 '20 at 22:24
@D3V any JSON that is an array at the outer level, not an object, will result in row explosion to expand that outer array. — nclark, Apr 15 '20 at 20:57

Use Spark to parse a JSON array column and carry ID column along as well

0 Answers0