Does Spark supports melt and dcast

Question

We use melt and dcast to convert data from wide->long and long->wide format. Refer http://seananderson.ca/2013/10/19/reshape.html for more details.

Either scala or SparkR is fine.

I've gone through this blog and scala functions and R API. I don't see functions which does similar job.

Is there any equivalent function in Spark? If not, is there any other way to do it in Spark?

Doesn't seem like it. If you can fit your data into memory, use `as.data.frame()` to convert the Spark DataFrame to a native data.frame, reshape that, and write it back to Spark. — Thomas, Apr 07 '16 at 12:30

score 10 · Answer 1 · answered May 10 '16 at 09:24

Reshaping Data with Pivot in Spark gives support for reshaping with pivot. I understood melt is roughly the reverse of pivot also called unpivot. I'm relatively new to Spark. With my knowledge i tried to implement melt operation.

    def melt(df: DataFrame, columns: List[String]): DataFrame ={

    val restOfTheColumns =  df.columns.filterNot(columns.contains(_))
    val baseDF = df.select(columns.head, columns.tail: _*)
    val newStructure =StructType(baseDF.schema.fields ++ List(StructField("variable", StringType, true), StructField("value", StringType, true)))
    var newdf  = sqlContext.createDataFrame(sqlContext.sparkContext.emptyRDD[Row], newStructure)

    for(variableCol <- restOfTheColumns){
      val colValues = df.select(variableCol).map(r=> r(0).toString)
      val colRdd=baseDF.rdd.zip(colValues).map(tuple => Row.fromSeq(tuple._1.toSeq.:+(variableCol).:+(tuple._2.toString)))
      var colDF =sqlContext.createDataFrame(colRdd, newStructure)
      newdf =newdf.unionAll(colDF)
    }
    newdf
  }

It does the work. But i am not very sure about the efficiency.

+-----+---+---+----------+------+
| name|sex|age|    street|weight|
+-----+---+---+----------+------+
|Alice|  f| 34| somewhere|    70|
|  Bob|  m| 63|   nowhere|   -70|
|Alice|  f|612|nextstreet|    23|
|  Bob|  m|612|      moon|     8|
+-----+---+---+----------+------+

Can be used as

melt(df, List("name", "sex"))

The result is as below:

+-----+---+--------+----------+
| name|sex|variable|     value|
+-----+---+--------+----------+
|Alice|  f|     age|        34|
|  Bob|  m|     age|        63|
|Alice|  f|     age|       612|
|  Bob|  m|     age|       612|
|Alice|  f|  street| somewhere|
|  Bob|  m|  street|   nowhere|
|Alice|  f|  street|nextstreet|
|  Bob|  m|  street|      moon|
|Alice|  f|  weight|        70|
|  Bob|  m|  weight|       -70|
|Alice|  f|  weight|        23|
|  Bob|  m|  weight|         8|
+-----+---+--------+----------+

I hope it is useful and appreciate your comments if there is room for improvements.

score 0 · Answer 2 · answered Oct 23 '16 at 23:56

Here's a spark.ml.Transformer that just uses dataset manipulations (no RDD stuff)

case class Melt(meltColumns: String*) extends Transformer{

  override def transform(in: Dataset[_]): DataFrame = {
    val nonMeltColumns =  in.columns.filterNot{ meltColumns.contains }
    val newDS = in
      .select(nonMeltColumns.head,meltColumns:_*)
      .withColumn("variable", functions.lit(nonMeltColumns.head))
      .withColumnRenamed(nonMeltColumns.head,"value")

    nonMeltColumns.tail
      .foldLeft(newDS){ case (acc,col) =>
        in
          .select(col,meltColumns:_*)
          .withColumn("variable", functions.lit(col))
          .withColumnRenamed(col,"value")
          .union(acc)
      }
      .select(meltColumns.head,meltColumns.tail ++ List("variable","value") : _*)
  }

  override def copy(extra: ParamMap): Transformer = defaultCopy(extra)

  @DeveloperApi
  override def transformSchema(schema: StructType): StructType = ???

  override val uid: String = Identifiable.randomUID("Melt")
}

Here's a test that uses it

"spark" should "melt a dataset" in {
    import spark.implicits._
    val schema = StructType(
      List(StructField("Melt1",StringType),StructField("Melt2",StringType)) ++
      Range(3,10).map{ i => StructField("name_"+i,DoubleType)}.toList)

    val ds = Range(1,11)
      .map{ i => Row("a" :: "b" :: Range(3,10).map{ j => Math.random() }.toList :_ *)}
      .|>{ rows => spark.sparkContext.parallelize(rows) }
      .|>{ rdd => spark.createDataFrame(rdd,schema) }

    val newDF = ds.transform{ df =>
      Melt("Melt1","Melt2").transform(df) }

    assert(newDF.count() === 70)
  }

.|> is the scalaZ pipe operator

score 0 · Answer 3 · answered Apr 24 '17 at 07:47

Spark DataFrame has explode method which provides R melt functionality. Example which works in Spark 1.6.1:

// input df has columns (anyDim, n1, n2)
case class MNV(measureName: String, measureValue: Integer);
val dfExploded = df.explode(col("n1"), col("n2")) {
  case Row(n1: Int, n2: Int) =>
  Array(MNV("n1", n1), MNV("n2", n2))
}
// dfExploded has columns (anyDim, n1, n2, measureName, measureValue)

Does Spark supports melt and dcast

3 Answers3