Reason I felt this was not a duplicate of this question:
- from_json requires knowledge of the json schema ex-ante, which I do not have knowledge of
- get_json_object - I attempted to use this, but the result of running get_json_object is itself a string, leaving me back at square one. Additionally, it appears (from the
exprs
statement) that - again - the author expects knowledge of the schema ex-ante, and is not inferring the schema.
Requirements:
ex-ante, I do not have knowledge of what the json schema is, and thus need to infer it. spark.read.json seems the best case for inferring the schema, but all the examples I came across loaded the json from files. In my use case, the json was contained within a column in a dataframe.
I am agnostic to the source file type (in this case, tested with parquet and csv). However, the source dataframe schema is and will be well structured. For my use case, the json is contained within a column in the source dataframe called 'fields'.
The resulting dataframe should link to the primary key in the source dataframe ('id' for my example).