6

I have a json object that has an unfortunate combination of nesting and arrays. So its not totally obvious how to query it with spark sql.

here is a sample object:

{
  stuff: [
    {a:1,b:2,c:3}
  ]
}

so, in javascript, to get the value for c, I'd write myData.stuff[0].c

And in my spark sql query, if that array wasn't there, I'd be able to use dot notation:

SELECT stuff.c FROM blah

but I can't, because the innermost object is wrapped in an array.

I've tried:

SELECT stuff.0.c FROM blah // FAIL
SELECT stuff.[0].c FROM blah // FAIL

So, what is the magical way to select that data? or is that even supported yet?

Kristian
  • 21,204
  • 19
  • 101
  • 176

1 Answers1

10

It is not clear what you mean by JSON object so lets consider two different cases:

  1. An array of structs

    import tempfile    
    
    path = tempfile.mktemp()
    with open(path, "w") as fw: 
        fw.write('''{"stuff": [{"a": 1, "b": 2, "c": 3}]}''')
    df = sqlContext.read.json(path)
    df.registerTempTable("df")
    
    df.printSchema()
    ## root
    ##  |-- stuff: array (nullable = true)
    ##  |    |-- element: struct (containsNull = true)
    ##  |    |    |-- a: long (nullable = true)
    ##  |    |    |-- b: long (nullable = true)
    ##  |    |    |-- c: long (nullable = true)
    
    sqlContext.sql("SELECT stuff[0].a FROM df").show()
    
    ## +---+
    ## |_c0|
    ## +---+
    ## |  1|
    ## +---+
    
  2. An array of maps

    # Note: schema inference from dictionaries has been deprecated
    # don't use this in practice
    df = sc.parallelize([{"stuff": [{"a": 1, "b": 2, "c": 3}]}]).toDF()
    df.registerTempTable("df")
    
    df.printSchema()
    ## root
    ##  |-- stuff: array (nullable = true)
    ##  |    |-- element: map (containsNull = true)
    ##  |    |    |-- key: string
    ##  |    |    |-- value: long (valueContainsNull = true)
    
    sqlContext.sql("SELECT stuff[0]['a'] FROM df").show()
    ## +---+
    ## |_c0|
    ## +---+
    ## |  1|
    ## +---+
    

See also Querying Spark SQL DataFrame with complex types

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935