How to convert each column in a dataframe to a row with ColumnName and ColumnValue

Question

I need to convert the following dataframe:

╔══════╦════════╦════════╦════════╗
║ Year ║  ColA  ║  ColB  ║  ColC  ║
╠══════╬════════╬════════╬════════╣
║ 2017 ║      1 ║      2 ║      3 ║
║ 2018 ║      4 ║      5 ║      6 ║
║ 2019 ║      7 ║      8 ║      9 ║
╚══════╩════════╩════════╩════════╝

Into this:

╔══════╦════════╦═══════╗
║ Year ║ColName ║ Value ║
╠══════╬════════╬═══════╣
║ 2017 ║  ColA  ║     1 ║
║ 2017 ║  ColB  ║     2 ║
║ 2017 ║  ColC  ║     3 ║
║ 2018 ║  ColA  ║     4 ║
║ 2018 ║  ColB  ║     5 ║
║ 2018 ║  ColC  ║     6 ║
║ 2019 ║  ColA  ║     7 ║
║ 2019 ║  ColB  ║     8 ║
║ 2019 ║  ColC  ║     9 ║
╚══════╩════════╩═══════╝

This needs to support any number of columns besides the first "Year" one, which could be 1 or many. And it should be a generic solution, meaning it should not use hard-coded column names anywhere, but it should read the column names directly from the original dataframe.

I'm using Databricks with a notebook written in Scala. Very new to both Spark and Scala.

UPDATE

I've found this solution in Python that works well, but I'm having a hard time converting it to Scala.

def columnsToRows(df, by):
  # Filter dtypes and split into column names and type description.
  # Only get columns not in "by".
  cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))

  # Create and explode an array of (column_name, column_value) structs
  kvs = F.explode(F.array([
    F.struct(F.lit(c.strip()).alias("ColName"), F.col(c).alias("Value")) for c in cols
  ])).alias("kvs")

  return df.select(by + [kvs]).select(by + ["kvs.ColName", "kvs.Value"])

Possible duplicate of [Transpose column to row with Spark](https://stackoverflow.com/questions/37864222/transpose-column-to-row-with-spark) — Raphael Roth, Aug 15 '19 at 15:24

score 2 · Accepted Answer · edited Aug 15 '19 at 15:14

2

You can use stack to transpose the data

val fixedColumns = Seq("Year", "FixedColumn")

val cols = df.columns
  .filter(c => !(fixedColumns.contains(c)))
  .map(c => (s"'${c}', ${c}" ))

val exp= cols.mkString(s"stack(${cols.size}, ", "," , ") as (Point, Value)")


df.select($"Year", expr(exp))

Output:

  +----+------+-----+
  |Year|Point |Value|
  +----+------+-----+
  |2017|PointA|1    |
  |2017|PointB|2    |
  |2017|PointC|3    |
  |2018|PointA|4    |
  |2018|PointB|5    |
  |2018|PointC|6    |
  |2019|PointA|7    |
  |2019|PointB|8    |
  |2019|PointC|9    |
  +----+------+-----+

edited Aug 15 '19 at 15:14

empz

11,509
16
65
106

answered Aug 15 '19 at 13:39

koiralo

22,594
6
51
72

Uhm, okay, but I need a generic solution as mentioned that would work for any number of columns (the first one would always be "Year") and doesn't use hardcoded column names as these could just be anything. – empz Aug 15 '19 at 14:19
Great, thanks. One more thing. Suppose I have a dataframe that besides Year, has some extra columns that doesn't need to be "transformed" into rows? Is there a way to have a list of colums that I want to keep as colums? Something like using a `Seq("Year", "Another Fixed Column", "Another one")` to avoid transforming these into rows? – empz Aug 15 '19 at 15:04
1

I've edited your answer to match what I need. Thanks! – empz Aug 15 '19 at 15:15
Thanks for updating, I hope the answer helped you! – koiralo Aug 16 '19 at 09:34

score 0 · Answer 2 · answered Aug 15 '19 at 15:23

Your python-code translates like this:

val colsToKeep = Seq("year").map(col) 
val colsToTransform = Seq("colA","colB","colC")

df.select((colsToKeep :+
  explode(
    array(colsToTransform.map(c => struct(lit(c).alias("colName"),col(c).alias("colValue"))):_*)
  ).as("NameValue")):_*)
  .select((colsToKeep :+ $"nameValue.colName":+$"nameValue.colValue"):_*)
  .show()

How to convert each column in a dataframe to a row with ColumnName and ColumnValue

2 Answers2