how to get all element from array of arrays of spark dataframe column scala

Question

I'm working with a dataframe df looks like:

   root
        |-- array(data1, data2, data3, data4): array (nullable = false)
        |    |-- element: array (containsNull = true)
        |    |    |-- element: struct (containsNull = true)
        |    |    |    |-- k: struct (nullable = false)
        |    |    |    |    |-- v: string (nullable = true)
        |    |    |    |    |-- t: string (nullable = false)
        |    |    |    |-- resourcename: string (nullable = true)
        |    |    |    |-- criticity: string (nullable = true)
        |    |    |    |-- v: string (nullable = true)
        |    |    |    |-- vn: double (nullable = true)

as described in df.show() in column "data" type array contains four array "data1" ,"data2", "data3","data4" have the same schema and type of data, I get this dataframe after

   df.withcolumn("Column1",array(col("data1"),col("data2")
   ,col("data3"),col("data4"))

I want to get new dataframe that contain all element of "data1" ,"data2", "data3" and "data4" in the same array. the new schema must be:

      |-- data: array (nullable = true)
      |    |-- element: struct (containsNull = true)
      |    |    |-- criticity: string (nullable = true)
      |    |    |-- k: struct (nullable = true)
      |    |    |    |-- t: string (nullable = true)
      |    |    |    |-- v: string (nullable = true)
      |    |    |-- resourcename: string (nullable = true)
      |    |    |-- v: string (nullable = true)
      |    |    |-- vn: double (nullable = true)

Possible duplicate of [Querying Spark SQL DataFrame with complex types](https://stackoverflow.com/questions/28332494/querying-spark-sql-dataframe-with-complex-types) — Aaron Makubuya, Oct 04 '18 at 17:21

score 0 · Answer 1 · answered Oct 05 '18 at 19:47

I recommend to use Datasets. You should start by defining three case classes:

case class MyClass1(t: String, v: String)
case class MyClass2(criticity:String, c1:MyClass1, resourcename:String, v:String, vn: Double)
case class MyList(data:Seq[Seq[MyClass2]])

Then create your Dataset like this:

val myDS = df.select(array($"data1",$"data2",$"data3",$"data4").as("data")).as[MyList]
// note than myDS.data has the type: list of lists of MyClass2

// Datasets allow us to make this kind of stuff (flatten data)
val myDSFlatten = myDS.flatMap(_.data)

"myDSFlatten" should have the desired schema.

Note i used Scala.

score 0 · Answer 2 · answered Aug 06 '19 at 12:46

0

If you use Spark >= 2.4, you can easily do this using the new function flatten.

flatten(arrayOfArrays) - Transforms an array of arrays into a single array.

answered Aug 06 '19 at 12:46

Semafoor

1,942
1
15
13

how to get all element from array of arrays of spark dataframe column scala

2 Answers2