0

I am looking for an efficient way to take the headers of several columns, and unpivot them into a single column. Here is an example:

I start with a table that looks like this. Three columns, one header for each of them.

| Header 1| Header 2| Header 3 |
___________________________________ 
| null    | null    | null    |
| null    | null    | null    |
| null    | null    | null    |

What I want to do is this ->

| Some Name   | Unique Name | Unique Name | Unique Name
_______________________________________________________
| Header 1    | null        | null        | null
| Header 2    | null        | null        | null
| Header 3    | null        | null        | null

I'm essentially attempting to transpose the Header names in each column to become their own values in a single column, which will receive a new header name. Each value in the row will also be a part of a new column which will receive a new header name. I understand how to take a column and use .pivot() function to create new headers based on the values of a column, but I've had trouble doing the reverse.

My research has shown that Python has .melt() which may or may not be the ideal solution to this problem, but being a new Scala developer and using Spark for the first time - I could use some advice for how to best approach this. My apologies if this is more simple than I imagine it to be!

Thank you for all of your assistance.

  • What are you trying to achieve with this "pivoting"? If I understand correctly you don't want to perform any aggregations, you just want to "flip" the table? This question here does something similar to what you are describing: Have a look at [this](https://stackoverflow.com/questions/40892459/spark-transpose-dataframe-without-aggregating) this seems similar to what you are trying to achieve. – Fudgy Jun 08 '21 at 09:39

1 Answers1

1

Take a look to this approach in Scala:

import org.apache.spark.sql.functions._

  def melt(
    df: DataFrame,
    idVars: Array[String],
    valueVars: Array[String],
    varName: String = "variable",
    valueName: String = "value"): DataFrame = {

    val columns = valueVars.map(c => Array(lit(c), col(c))).flatten
    val varsAndVals = map(columns: _*)
    df.select(idVars.map(col(_)).:+(explode(varsAndVals)): _*)
      .withColumnRenamed("key", varName)
      .withColumnRenamed("value", valueName)
  }

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[1]").getOrCreate()

    val df = spark
      .createDataFrame(
        spark.sparkContext.parallelize(
          Seq(Row("a", 1, 2, null), Row("b", 3, 4, 7), Row("c", 5, 6, 9))),
            StructType(
              List(
                StructField("A", StringType),
                StructField("B", IntegerType),
                StructField("C", IntegerType),
                StructField("D", IntegerType))))

    melt(df, Array("A"), Array("B", "C", "D")).show()
  }

Input dataframe:

+---+---+---+----+
|  A|  B|  C|   D|
+---+---+---+----+
|  a|  1|  2|null|
|  b|  3|  4|   7|
|  c|  5|  6|   9|
+---+---+---+----+

Unpivoted DF:

+---+--------+-----+
|  A|variable|value|
+---+--------+-----+
|  a|       B|    1|
|  a|       C|    2|
|  a|       D| null|
|  b|       B|    3|
|  b|       C|    4|
|  b|       D|    7|
|  c|       B|    5|
|  c|       C|    6|
|  c|       D|    9|
+---+--------+-----+

It is adapted from this question How to melt Spark DataFrame?

Emiliano Martinez
  • 4,073
  • 2
  • 9
  • 19
  • This is what I was looking for. Thank you for taking the time to show an example of the melt implementation in Scala. Very helpful to me. Marked your answer as the solution. – Daniel Volosov Jun 08 '21 at 13:27