0

I have a data frame and I would like to use Scala to explode rows into multiple rows using the values in multiple columns. Ideally I am looking to replicate the behavior of the R function melt().

All the columns contain Strings.

Example: I want to transform this data frame..

df.show
+--------+-----------+-------------+-----+----+
|col1    | col2       | col3       | res1|res2|
+--------+-----------+-------------+-----+----+
|       a|    baseline| equivalence| TRUE| 0.1|
|       a| experiment1| equivalence|FALSE|0.01|
|       b|    baseline| equivalence| TRUE| 0.2|
|       b| experiment1| equivalence|FALSE|0.02|
+--------+-----------+-------------+-----+----+

...Into this data frame:

+--------+-----------+-------------+-----+-------+
|col1    | col2      | col3        | key  |value|
+--------+-----------+-------------+-----+-------+
|       a|   baseline|  equivalence| res1 | TRUE |
|       a|experiment1|  equivalence| res1 | FALSE|
|       b|   baseline|  equivalence| res1 | TRUE |
|       b|experiment1|  equivalence| res1 | FALSE|
|       a|   baseline|  equivalence| res2 | 0.1  |
|       a|experiment1|  equivalence| res2 | 0.01 |
|       b|   baseline|  equivalence| res2 | 0.2  |
|       b|experiment1|  equivalence| res2 | 0.02 |
+--------+-----------+-------------+-----+-------+
  • Is there a built-in function in Scala which applies to datasets or data frames to do this?
  • If not, would it be relatively simple to implement this? How would it be done at a high level?

Note: I have found the class UnpivotOp from SMV which would do exactly what I want: (https://github.com/TresAmigosSD/SMV/blob/master/src/main/scala/org/tresamigos/smv/UnpivotOp.scala).

Unfortunately, the class is private, so I cannot do something like this:

import org.tresamigos.smv.UnpivotOp
val melter = new UnpivotOp(df,  Seq("res1","res2"))
val melted_df = melter.unpivot()

Does anyone know if there a way to access the class org.tresamigos.smv.UnpivotOp via some some other class of static method of SMV?

Thanks!

Raphvanns
  • 1,766
  • 19
  • 21
  • 2
    Hello @user6910411 : not really a dup: while the answer for the question is similar to that provided in "unpivot in spark-sql/pyspark" it is not the same. The answer provided in "unpivot in spark-sql/pyspark" is for pyspark, not scala. In addition, the question in "unpivot in spark-sql/pyspark" does not reference the R function `melt()`, it took me some time to find "unpivot in spark-sql/pyspark" and realize it could be re-used to resolve the particular problem asked here. Having a reference to `melt()` will help R users looking to reproduce the behavior of `melt()` with Scala. – Raphvanns Feb 21 '18 at 19:40

1 Answers1

0

Thanks to Andrew's Ray answer to unpivot in spark-sql/pyspark This did the trick:

df.select($"col1",
          $"col2",
          $"col3",
          expr("stack(2, 'res1', res1, 'res2', res2) as (key, value)"))

Or if the expression for the select should be passed as strings (handy for df %>% sparklyr::invoke("")):

df.selectExpr("col1", 
              "col2", 
              "col3",
              "stack(2, 'res1', res1, 'res2', res2) as (key, value)")
Raphvanns
  • 1,766
  • 19
  • 21