0

I need to convert the following dataframe:

╔══════╦════════╦════════╦════════╗
║ Year ║  ColA  ║  ColB  ║  ColC  ║
╠══════╬════════╬════════╬════════╣
║ 2017 ║      1 ║      2 ║      3 ║
║ 2018 ║      4 ║      5 ║      6 ║
║ 2019 ║      7 ║      8 ║      9 ║
╚══════╩════════╩════════╩════════╝

Into this:

╔══════╦════════╦═══════╗
║ Year ║ColName ║ Value ║
╠══════╬════════╬═══════╣
║ 2017 ║  ColA  ║     1 ║
║ 2017 ║  ColB  ║     2 ║
║ 2017 ║  ColC  ║     3 ║
║ 2018 ║  ColA  ║     4 ║
║ 2018 ║  ColB  ║     5 ║
║ 2018 ║  ColC  ║     6 ║
║ 2019 ║  ColA  ║     7 ║
║ 2019 ║  ColB  ║     8 ║
║ 2019 ║  ColC  ║     9 ║
╚══════╩════════╩═══════╝

This needs to support any number of columns besides the first "Year" one, which could be 1 or many. And it should be a generic solution, meaning it should not use hard-coded column names anywhere, but it should read the column names directly from the original dataframe.

I'm using Databricks with a notebook written in Scala. Very new to both Spark and Scala.

UPDATE

I've found this solution in Python that works well, but I'm having a hard time converting it to Scala.

def columnsToRows(df, by):
  # Filter dtypes and split into column names and type description.
  # Only get columns not in "by".
  cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))

  # Create and explode an array of (column_name, column_value) structs
  kvs = F.explode(F.array([
    F.struct(F.lit(c.strip()).alias("ColName"), F.col(c).alias("Value")) for c in cols
  ])).alias("kvs")

  return df.select(by + [kvs]).select(by + ["kvs.ColName", "kvs.Value"])
empz
  • 11,509
  • 16
  • 65
  • 106
  • Possible duplicate of [Transpose column to row with Spark](https://stackoverflow.com/questions/37864222/transpose-column-to-row-with-spark) – Raphael Roth Aug 15 '19 at 15:24

2 Answers2

2

You can use stack to transpose the data

val fixedColumns = Seq("Year", "FixedColumn")

val cols = df.columns
  .filter(c => !(fixedColumns.contains(c)))
  .map(c => (s"'${c}', ${c}" ))

val exp= cols.mkString(s"stack(${cols.size}, ", "," , ") as (Point, Value)")


df.select($"Year", expr(exp))

Output:

  +----+------+-----+
  |Year|Point |Value|
  +----+------+-----+
  |2017|PointA|1    |
  |2017|PointB|2    |
  |2017|PointC|3    |
  |2018|PointA|4    |
  |2018|PointB|5    |
  |2018|PointC|6    |
  |2019|PointA|7    |
  |2019|PointB|8    |
  |2019|PointC|9    |
  +----+------+-----+
empz
  • 11,509
  • 16
  • 65
  • 106
koiralo
  • 22,594
  • 6
  • 51
  • 72
  • Uhm, okay, but I need a generic solution as mentioned that would work for any number of columns (the first one would always be "Year") and doesn't use hardcoded column names as these could just be anything. – empz Aug 15 '19 at 14:19
  • Great, thanks. One more thing. Suppose I have a dataframe that besides Year, has some extra columns that doesn't need to be "transformed" into rows? Is there a way to have a list of colums that I want to keep as colums? Something like using a `Seq("Year", "Another Fixed Column", "Another one")` to avoid transforming these into rows? – empz Aug 15 '19 at 15:04
  • 1
    I've edited your answer to match what I need. Thanks! – empz Aug 15 '19 at 15:15
  • Thanks for updating, I hope the answer helped you! – koiralo Aug 16 '19 at 09:34
0

Your python-code translates like this:

val colsToKeep = Seq("year").map(col) 
val colsToTransform = Seq("colA","colB","colC")

df.select((colsToKeep :+
  explode(
    array(colsToTransform.map(c => struct(lit(c).alias("colName"),col(c).alias("colValue"))):_*)
  ).as("NameValue")):_*)
  .select((colsToKeep :+ $"nameValue.colName":+$"nameValue.colValue"):_*)
  .show()
Raphael Roth
  • 26,751
  • 15
  • 88
  • 145