1

So my initial schema looks like this:

root  
|-- database: String  
|-- table: String  
|-- data: struct (nullable = true)  
|    |-- element1: Int  
|    |-- element2: Char

The show() result has one data column that's ugly with [null,2,3] etc

What I want to do is to make the data struct into it's own dataframe so I can have the nested json's data spread out among columns but something like:

val dfNew = df.select("data") only really gets me the same gross column when I use show() instead of the multiple columns specified by the schema (element1, element2) etc.

Is there a way to do this?

Yuan JI
  • 2,927
  • 2
  • 20
  • 29
Brady Auen
  • 215
  • 3
  • 13
  • Possible duplicate of [Querying Spark SQL DataFrame with complex types](http://stackoverflow.com/questions/28332494/querying-spark-sql-dataframe-with-complex-types) – zero323 Jul 18 '16 at 21:17
  • 1
    Check out [pandas.io.json.json_normalize](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.json.json_normalize.html). – Alicia Garcia-Raboso Jul 18 '16 at 21:41

1 Answers1

2

Like this?

case class Data(element1: Int, element2: String)

val df = sqlContext.createDataFrame(sc.parallelize(Array(
        (1, Data(12312, "test"))))).toDF("i", "data")

df.select(col("data.element1"), col("data.element2"))

or this?

df.select(col("data.*"))
  • Along that, I'd like to be able to do it without specifying each column so I could just take all that are available. – Brady Auen Jul 18 '16 at 21:45
  • That second one looks like what I want. I tried this val dfdata2 = df.select(df.col("data.*")) And it didn't work, only one column. – Brady Auen Jul 18 '16 at 21:57
  • 1
    I couldn't get it to work for me, but this does `val dfdata2 = df.selectExpr("data.*")` and apparently this one too: `val dfdata = df.select("data.*")` – Brady Auen Jul 18 '16 at 22:12