3

Reason I felt this was not a duplicate of this question:

  • from_json requires knowledge of the json schema ex-ante, which I do not have knowledge of
  • get_json_object - I attempted to use this, but the result of running get_json_object is itself a string, leaving me back at square one. Additionally, it appears (from the exprs statement) that - again - the author expects knowledge of the schema ex-ante, and is not inferring the schema.

Requirements:

  • ex-ante, I do not have knowledge of what the json schema is, and thus need to infer it. spark.read.json seems the best case for inferring the schema, but all the examples I came across loaded the json from files. In my use case, the json was contained within a column in a dataframe.

  • I am agnostic to the source file type (in this case, tested with parquet and csv). However, the source dataframe schema is and will be well structured. For my use case, the json is contained within a column in the source dataframe called 'fields'.

  • The resulting dataframe should link to the primary key in the source dataframe ('id' for my example).

benhorgen
  • 1,928
  • 1
  • 33
  • 38
p5k6
  • 101
  • 1
  • 9
  • Possible duplicate of [How to query JSON data column using Spark DataFrames?](https://stackoverflow.com/questions/34069282/how-to-query-json-data-column-using-spark-dataframes) – 10465355 Jan 10 '19 at 20:00
  • Reason I felt this was not a duplicate: from_json requires knowledge of the schema ex-ante, which I do not have knowledge of get_json_object - I attempted to use this, but I could not figure out how to use the extracted value from this _into its own dataframe_. – p5k6 Jan 10 '19 at 20:09

1 Answers1

3

The key turned out to be in the spark source code. path when passed to spark.read.json may be a "RDD of Strings storing json objects".

Here's the source dataframe schema:

The code I came up with was:

def inject_id(row):
    js = json.loads(row['fields'])
    js['id'] = row['id']
    return json.dumps(js)
json_df = spark.read.json(df.rdd.map(inject_id))

json_df then had a schema as such

Note that - I did not test this with a more nested structure, but I believe it will support anything that spark.read.json supports.

p5k6
  • 101
  • 1
  • 9