23

I have a schema as shown below. How can i parse the nested objects

root
 |-- apps: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- appName: string (nullable = true)
 |    |    |-- appPackage: string (nullable = true)
 |    |    |-- Ratings: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- date: string (nullable = true)
 |    |    |    |    |-- rating: long (nullable = true)
 |-- id: string (nullable = true)
None
  • 1,448
  • 1
  • 18
  • 36

5 Answers5

28

Assuming you read in a json file and print the schema you are showing us like this:

DataFrame df = sqlContext.read().json("/path/to/file").toDF();
    df.registerTempTable("df");
    df.printSchema();

Then you can select nested objects inside a struct type like so...

DataFrame app = df.select("app");
        app.registerTempTable("app");
        app.printSchema();
        app.show();
DataFrame appName = app.select("element.appName");
        appName.registerTempTable("appName");
        appName.printSchema();
        appName.show();
Vasilis Vagias
  • 416
  • 4
  • 6
  • 9
    just to add, above code does not need `registerTempTable` to work. You need to `registerTempTable` only when you need to execute spark sql query. Also `registerTempTable` had been deprecated since Spark 2.0 and had been replaced by `createOrReplaceTempView` – Arjit Dec 30 '16 at 06:36
  • 1
    This is assuming that you know the schema. What if you are not sure about the schema of the nested object? How do you even create the schema of the nested object at all? I kinda asked this question in here too: https://stackoverflow.com/questions/43438774/query-json-data-column-using-spark-dataframes-but-not-sure-about-its-schema – M.Rez Apr 16 '17 at 17:07
  • 1
    I am having the same problem, and this code does not work for me. When I try to `select("app.element.appName")` (or the analagous fields for my case, I get the error `org.apache.spark.sql.AnalysisException: No such struct field element in...`. The element field is not present in the original json but is created to represent a jsonarray. but for some reason it isn't finding it – Paul Nov 04 '17 at 19:49
5

Try this:

val nameAndAddress = sqlContext.sql("""
    SELECT name, address.city, address.state
    FROM people
""")
nameAndAddress.collect.foreach(println)

Source: https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html

Kristian
  • 21,204
  • 19
  • 101
  • 176
ben jarman
  • 1,138
  • 10
  • 11
3

Have you tried doing it straight from the SQL query like

Select apps.element.Ratings from yourTableName

This will probably return an array and you can more easily access the elements inside. Also, I use this online Json viewer when I have to deal with large JSON structures and the schema is too complex: http://jsonviewer.stack.hu/

Aleksandr M
  • 24,264
  • 12
  • 69
  • 143
3

I am using pyspark, but the logic should be similar. I found this way of parsing my nested json useful:

df.select(df.apps.appName.alias("apps_Name"), \
          df.apps.appPackage.alias("apps_Package"), \
          df.apps.Ratings.date.alias("apps_Ratings_date")) \
   .show()

The code could be obviously shorten with a f-string.

1
var df = spark.read.format("json").load("/path/to/file")
df.createOrReplaceTempView("df");
spark.sql("select apps.element.Ratings from df where apps.element.appName like '%app_name%' ").show()
Ganesh
  • 677
  • 8
  • 11