0

I'm trying to create a dataset from a json-string within a dataframe in Databricks 3.5 (Spark 2.2.1). In the code block below 'jsonSchema' is a StructType with the correct layout for the json-string which is in the 'body' column of the dataframe.

val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema))

This returns a dataframe where the root object is

jsontostructs(CAST(body AS STRING)):struct

followed by the fields in the schema (looks correct). When I try another select on the newDF

val transform = newDF.select($"propertyNameInTheParsedJsonObject")

it throws the exception

org.apache.spark.sql.AnalysisException: cannot resolve '`columnName`' given 
input columns: [jsontostructs(CAST(body AS STRING))];;

I'm aparently missing something. I hoped from_json would return a dataframe I could manipulate further.

My ultimate objective is to cast the json-string within the oldDF body-column to a dataset.

Molotch
  • 365
  • 7
  • 20

1 Answers1

2

from_json returns a struct or (array<struct<...>>) column. It means it is a nested object. If you've provided a meaningful name:

val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema) as "parsed")

and the schema describes a plain struct you could use standard methods like

newDF.select($"parsed.propertyNameInTheParsedJsonObject")

otherwise please follow the instructions for accessing arrays.