2
root
 |
 |-- dogs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: struct (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- color: string (nullable = true)
 |    |    |    |-- sources: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |-- _2: age (nullable = true)

Which shows below with data.select("dogs").show(2,False)

+---------------------------------------------------------------------------------+
|names                                                                            |
+---------------------------------------------------------------------------------+
|[[[Max,White,WrappedArray(SanDiego)],3], [[Spot,Black,WrappedArray(SanDiego)],2]]|
|[[[Michael,Black,WrappedArray(SanJose)],1]]                                      |
+---------------------------------------------------------------------------------+
only showing top 2 rows

I am wondering if it is possible to access the array elements in each cell? For example, I want to retrieve (Max, white), (Spot, Black) and (Michael, Black) from the dogs column.

In additional, I would like to expand the rows with n elements to n rows if possible.

Thanks!

Edamame
  • 23,718
  • 73
  • 186
  • 320
  • 2
    Possible duplicate of [Querying Spark SQL DataFrame with complex types](http://stackoverflow.com/questions/28332494/querying-spark-sql-dataframe-with-complex-types) – zero323 Apr 25 '16 at 19:06
  • It is the same question in scala-spark, though Edamame seems to be working with pyspark code. Not sure how SO should organize these (especially given their similarity), but the pyspark equivalent answer is below. – David Apr 25 '16 at 19:24
  • can you post a sample data set. – dheee Apr 25 '16 at 20:09

1 Answers1

3

You can use explode as below to get access to a dataframe with each row being a record from the array.

data.registerTempTable("data")
dataExplode = sqlContext.sql("select explode(dogs) as dog from data")
dataExplode.show()

Then, you can use select to obtain just the columns you are interested in.

David
  • 11,245
  • 3
  • 41
  • 46
  • @Edamame sorry, there was a typo in the code. I forgot the quotes around "data" in registerTempTable. I edited the code, hopefully it works for you now – David Apr 25 '16 at 20:09
  • `explode()` is useful when working with nested datatypes in dataframes. – Myles Baker Jul 29 '16 at 15:32