8

I have the following dataframe in spark:

val test = sqlContext.read.json(path = "/path/to/jsonfiles/*")  
test.printSchema
root
 |-- properties: struct (nullable = true)
 |    |-- prop_1: string (nullable = true)
 |    |-- prop_2: string (nullable = true)
 |    |-- prop_3: boolean (nullable = true)
 |    |-- prop_4: long (nullable = true)
...

What I would like to do is flatten this dataframe so that the prop_1 ... prop_n exist at the top level. I.e.

test.printSchema
root
|-- prop_1: string (nullable = true)
|-- prop_2: string (nullable = true)
|-- prop_3: boolean (nullable = true)
|-- prop_4: long (nullable = true)
...

There are several solutions to similar problems. The best I can find is posed here. However, solution only works if properties is of type Array. In my case, properties is of type StructType.

An alternate approach would be something like:

test.registerTempTable("test")
val test2 = sqlContext.sql("""SELECT properties.prop_1, ... FROM test""")

But in this case I have to explicitly specify each row, and that is inelegant.

What is the best way to solve this problem?

lospejos
  • 1,976
  • 3
  • 19
  • 35
Logister
  • 1,852
  • 23
  • 26
  • 1
    Does this answer your question? [How to flatten a struct in a Spark dataframe?](https://stackoverflow.com/questions/38753898/how-to-flatten-a-struct-in-a-spark-dataframe) – Narahari B M Dec 11 '20 at 20:33

1 Answers1

12

If you're not looking for a recursive solution then in 1.6+ dot syntax with star should work just fine:

val df = sqlContext.read.json(sc.parallelize(Seq(
  """{"properties": {
       "prop1": "foo", "prop2": "bar", "prop3": true, "prop4": 1}}"""
)))

df.select($"properties.*").printSchema
// root
//  |-- prop1: string (nullable = true)
//  |-- prop2: string (nullable = true)
//  |-- prop3: boolean (nullable = true)
//  |-- prop4: long (nullable = true)

Unfortunately this doesn't work in 1.5 and before.

In case like this you can simply extract required information directly from the schema. You'll find one example in Dropping a nested column from Spark DataFrame which should be easy to adjust to fit this scenario and another one (recursive schema flattening in Python) Pyspark: Map a SchemaRDD into a SchemaRDD.

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935