5

Context

I have a data frame containing (what I think are) couples of (String, String).

It looks like this:

> df.show
| Col1 | Col2    |
| A    | [k1, v1]|
| A    | [k2, v2]|

> df.printSchema
|-- _1: string (nullable = true)
|-- _2: struct (nullable = true)
|    |-- _1: string (nullable = true)
|    |-- _2: string (nullable = true)

Col2 used to contain a Map[String, String] on which I have done a toList() and then explode() to obtain one row per mapping present in the original Map.


Question

I would like to split Col2 into 2 columns and obtain this dataframe:

| Col1 | key    | value |
| A    | k1     | v1    |
| A    | k2     | v2    |

Does anyone know how to do this?

Alternatively, Does anyone know how to explode+split a map into multiple rows (one per mapping) and 2 columns (one for key, one for value).


Thing I have tried / Error

I tried using the usually successful pattern with (String, String) but this does not work:

df.select("Col1", "Col2").
   map(r =>(r(0).asInstanceOf[String],
            r(1).asInstanceOf[(String, String)](0),
            r(1).asInstanceOf[(String, String)](1)
           )
       )

Caused by: java.lang.ClassCastException:
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple2

==> I guess the type of Col2 is org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema, could not find spark / scala doc for this.

And even if that worked, there would then be the issue that using indexes is not the right way to access elements of a tuple...

Thanks!

Raphvanns
  • 1,766
  • 19
  • 21

2 Answers2

9

You can use select to project each element of struct to unpack it.

df.select($"Col1", $"Col2._1".as("key"), $"Col2._2".as("value"))
Thang Nguyen
  • 1,110
  • 8
  • 17
5

You can just add another method to do that:

df.withColumn("key", $"Col2._1")
  .withColumn("value", $"Col2._2")
Timtech
  • 1,224
  • 2
  • 19
  • 30