Parse JSON root in a column using Spark-Scala

Question

I've problems in order to transform the root of a JSOM a record in a data frame for an undetermined number of records.

I've a data frame generated with a JSON similar the following:

val exampleJson = spark.createDataset(
  """
  {"ITEM1512":
        {"name":"Yin",
         "address":{"city":"Columbus",
                    "state":"Ohio"}
                    }, 
    "ITEM1518":
        {"name":"Yang",
         "address":{"city":"Working",
                    "state":"Marc"}
                    }
  }""" :: Nil)

When I read it whit the following instruction

val itemsExample = spark.read.json(exampleJson)

The Schema and Data Frame generated is the following:

+-----------------------+-----------------------+
|ITEM1512               |ITEM1518               |
+-----------------------+-----------------------+
|[[Columbus, Ohio], Yin]|[[Working, Marc], Yang]|
+-----------------------+-----------------------+

root
 |-- ITEM1512: struct (nullable = true)
 |    |-- address: struct (nullable = true)
 |    |    |-- city: string (nullable = true)
 |    |    |-- state: string (nullable = true)
 |    |-- name: string (nullable = true)
 |-- ITEM1518: struct (nullable = true)
 |    |-- address: struct (nullable = true)
 |    |    |-- city: string (nullable = true)
 |    |    |-- state: string (nullable = true)
 |    |-- name: string (nullable = true)

But i want to generate something like this:

+-----------------------+-----------------------+
|Item                   |Values                 |
+-----------------------+-----------------------+
|ITEM1512               |[[Columbus, Ohio], Yin]|
|ITEM1518               |[[Working, Marc], Yang]|
+-----------------------+-----------------------+

So, in order to parse this JSON data I need to read all the columns and added it to a record in the Data Frame, because there are more than this two items that i write as example. In fact, there are millions of items that I'd like to add in a Data Frame.

I'm trying to replicate the solution found here in: How to parse the JSON data using Spark-Scala with this code:

val columns:Array[String]       = itemsExample.columns
var arrayOfDFs:Array[DataFrame] = Array() 

for(col_name <- columns){

  val temp = itemsExample.selectExpr("explode("+col_name+") as element")
    .select(
      lit(col_name).as("Item"),
      col("element.E").as("Value"))

  arrayOfDFs = arrayOfDFs :+ temp
}

val jsonDF = arrayOfDFs.reduce(_ union _)
jsonDF.show(false)

But I face with the problem while in the example reading in the other question the root is in array in my case the root is an StrucType. Therefore the next exception is thrown:

org.apache.spark.sql.AnalysisException: cannot resolve 'explode(ITEM1512)' due to data type mismatch: input to function explode should be array or map type, not struct,name:string>

you are very close, and don't really need explode. Just replace temp expression with `val temp = itemsExample.select(lit(col_name).as("Item"), col(col_name).as("Value"))` — abiratsis, May 08 '20 at 01:16
It might be easier to just change the json to `array` instead of trying to do `explode` afterward. Especially, if your json obj is multi-nested. — moon, May 08 '20 at 01:17
Didn't that work @jqc? Did you try to replace the temp variable? — abiratsis, May 09 '20 at 17:03

notNull · Accepted Answer · 2020-05-09T13:45:12.763

You can use stack function.

Example:

itemsExample.selectExpr("""stack(2,'ITEM1512',ITEM1512,'ITEM1518',ITEM1518) as (Item,Values)""").
show(false)
//+--------+-----------------------+
//|Item    |Values                 |
//+--------+-----------------------+
//|ITEM1512|[[Columbus, Ohio], Yin]|
//|ITEM1518|[[Working, Marc], Yang]|
//+--------+-----------------------+

UPDATE:

Dynamic Stack query:

val stack=df.columns.map(x => s"'${x}',${x}").mkString(s"stack(${df.columns.size},",",",")as (Item,Values)")
//stack(2,'ITEM1512',ITEM1512,'ITEM1518',ITEM1518) as (Item,Values)

itemsExample.selectExpr(stack).show()
//+--------+-----------------------+
//|Item    |Values                 |
//+--------+-----------------------+
//|ITEM1512|[[Columbus, Ohio], Yin]|
//|ITEM1518|[[Working, Marc], Yang]|
//+--------+-----------------------+

Hello! I Notice that despite the fact this solution it's Ok. I need to merge a data frame with different schemas. So I've to make something similar as @AlexandrosBiratsis said above. But, I face with the problem with the union. I made another question if you would like to see: https://stackoverflow.com/questions/61739858/merge-dataframes-with-differents-schemas-scala-spark — Jino Michel Aque, May 11 '20 at 21:41

Parse JSON root in a column using Spark-Scala

1 Answers1

Linked