I have a table with two columns -- one called json
, a string containing a JSON array, and the other an Int
called id
. If I select just the json
column and convert the resulting DataFrame
to a Dataset
, then parse the json using Reader.json()
, the JSON array in the resulting DataFrame
will be exploded out into multiple rows. Now there's no way I can see to correlate id
back to the correct exploded rows. In Java this looks something like:
Dataset<Row> df = reader.table("myTable"); // DataFrame containing json and id
df = df.selectExpr("json"); // DataFrame containing just json
Dataset<String> ds = df.toJavaRDD() // Dataset<String> containing just json
.map((Function<Row, String>) row -> (String) row.get(0)).rdd(), STRING());
df = reader.schema(jsonDdl).json(ds); // exploded DataFrame, one row per array element
...now what? I have no way to correlate id
back to the rows coming from the JSON parse operation. How can I do this differently such that the relationship between id
and the parsed JSON columns is preserved?