tidyr::spread() with plain Scala and Spark (rows to columns)

Question

my prototype (written in R with the packages dplyr and tidyr) is hitting a wall in terms of computational complexity - even on my powerfull working station. Therefore, I want to port the code to Spark using Scala.

I looked up all transformations, actions, functions (SparkSQL) and column operations (also SparkSQL) and found all function equivalents except the one for the tidyr::spread() function, available in R.

df %>% tidyr::spread(key = COL_KEY , value = COL_VAL) basically spreads a key-value pair across multiple columns. E.g. the table

COL_KEY | COL_VAL
-----------------
A       | 1
B       | 1
A       | 2

will be transformed to by

A       | B
------------
1       | 0
0       | 1
2       | 1

In case there is no "out-of-the-box"-solution available: Could you point me in the right direction? Maybe a user defined function?

I'm free which Spark (and Scala) version to choose (therefore I'd go for the latest, 2.0.0).

Thanks!

http://stackoverflow.com/questions/31927500/is-it-possible-to-use-a-java-8-style-method-references-in-scala — Shankar, Sep 12 '16 at 14:10

score 2 · Accepted Answer · answered Sep 12 '16 at 14:14

Out-of-the-box but requires a shuffle:

df
  // A dummy unique key to perform grouping
  .withColumn("_id", monotonically_increasing_id)
  .groupBy("_id")
  .pivot("COL_KEY")
  .agg(first("COL_VAL"))
  .drop("_id")

// +----+----+
// |   A|   B|
// +----+----+
// |   1|null|
// |null|   1|
// |   2|null|
// +----+----+

You can optionally follow it with .na.fill(0).

Manually without shuffle:

//  Find distinct keys
val keys = df.select($"COL_KEY").as[String].distinct.collect.sorted

// Create column expressions for each key
val exprs =  keys.map(key => 
  when($"COL_KEY" === key, $"COL_VAL").otherwise(lit(0)).alias(key)
)

df.select(exprs: _*)

// +---+---+
// |  A|  B|
// +---+---+
// |  1|  0|
// |  0|  1|
// |  2|  0|
// +---+---+

In case somebody tries this out: Dont' forget to import lit and when using `import org.apache.spark.sql.functions.{lit, when}` (second option) — Boern, Nov 07 '16 at 09:08

tidyr::spread() with plain Scala and Spark (rows to columns)

1 Answers1