0

I'm working with a dataframe df looks like:

   root
        |-- array(data1, data2, data3, data4): array (nullable = false)
        |    |-- element: array (containsNull = true)
        |    |    |-- element: struct (containsNull = true)
        |    |    |    |-- k: struct (nullable = false)
        |    |    |    |    |-- v: string (nullable = true)
        |    |    |    |    |-- t: string (nullable = false)
        |    |    |    |-- resourcename: string (nullable = true)
        |    |    |    |-- criticity: string (nullable = true)
        |    |    |    |-- v: string (nullable = true)
        |    |    |    |-- vn: double (nullable = true)

as described in df.show() in column "data" type array contains four array "data1" ,"data2", "data3","data4" have the same schema and type of data, I get this dataframe after

   df.withcolumn("Column1",array(col("data1"),col("data2")
   ,col("data3"),col("data4"))

I want to get new dataframe that contain all element of "data1" ,"data2", "data3" and "data4" in the same array. the new schema must be:

      |-- data: array (nullable = true)
      |    |-- element: struct (containsNull = true)
      |    |    |-- criticity: string (nullable = true)
      |    |    |-- k: struct (nullable = true)
      |    |    |    |-- t: string (nullable = true)
      |    |    |    |-- v: string (nullable = true)
      |    |    |-- resourcename: string (nullable = true)
      |    |    |-- v: string (nullable = true)
      |    |    |-- vn: double (nullable = true) 
Chaouki
  • 446
  • 2
  • 8
  • 20
  • Possible duplicate of [Querying Spark SQL DataFrame with complex types](https://stackoverflow.com/questions/28332494/querying-spark-sql-dataframe-with-complex-types) – Aaron Makubuya Oct 04 '18 at 17:21

2 Answers2

0

I recommend to use Datasets. You should start by defining three case classes:

case class MyClass1(t: String, v: String)
case class MyClass2(criticity:String, c1:MyClass1, resourcename:String, v:String, vn: Double)
case class MyList(data:Seq[Seq[MyClass2]])

Then create your Dataset like this:

val myDS = df.select(array($"data1",$"data2",$"data3",$"data4").as("data")).as[MyList]
// note than myDS.data has the type: list of lists of MyClass2

// Datasets allow us to make this kind of stuff (flatten data)
val myDSFlatten = myDS.flatMap(_.data)

"myDSFlatten" should have the desired schema.

Note i used Scala.

0

If you use Spark >= 2.4, you can easily do this using the new function flatten.

flatten(arrayOfArrays) - Transforms an array of arrays into a single array.

Semafoor
  • 1,942
  • 1
  • 15
  • 13