Spark withColumn - add column using non-Column type variable

Question

How can I add a column to a data frame from a variable value?

I know that I can create a data frame using .toDF(colName) and that .withColumn is the method to add the column. But, when I try the following, I get a type mismatch error:

val myList = List(1,2,3)
val myArray = Array(1,2,3)

myList.toDF("myList")
  .withColumn("myArray", myArray)

Type mismatch, expected: Column, actual: Array[Int]

This compile error is on myArray within the .withColumn call. How can I convert it from an Array[Int] to a Column type?

score 2 · Answer 1 · answered Apr 06 '18 at 23:13

The error message has exactly what is up, you need to input a column (or a lit()) as the second argument as withColumn()

try this

import org.apache.spark.sql.functions.typedLit

val myList = List(1,2,3)
val myArray = Array(1,2,3)

myList.toDF("myList")
  .withColumn("myArray", typedLit(myArray))

:)

Leo C · Answer 2 · 2018-04-09T20:14:27.143

Not sure withColumn is what you're actually seeking. You could apply lit() to make myArray conform to the method specs, but the result will be the same array value for every row in the DataFrame:

myList.toDF("myList").withColumn("myArray", lit(myArray)).
  show
// +------+---------+
// |myList|  myArray|
// +------+---------+
// |     1|[1, 2, 3]|
// |     2|[1, 2, 3]|
// |     3|[1, 2, 3]|
// +------+---------+

If you're trying to merge the two collections column-wise, it's a different transformation from what withColumn offers. In that case you'll need to convert each of them into a DataFrame and combine them via a join.

Now if the elements of the two collections are row-identifying and match each other pair-wise like in your example and you want to join them that way, you can simply join the converted DataFrames:

myList.toDF("myList").join(
    myArray.toSeq.toDF("myArray"), $"myList" === $"myArray"
  ).show
// +------+-------+
// |myList|myArray|
// +------+-------+
// |     1|      1|
// |     2|      2|
// |     3|      3|
// +------+-------+

But in case the two collections have elements that aren't join-able and you simply want to merge them column-wise, you'll need to use compatible row-identifying columns from the two dataframes to join them. And if there isn't such row-identifying columns, one approach would be to create your own rowIds, as in the following example:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val df1 = List("a", "b", "c").toDF("myList")
val df2 = Array("x", "y", "z").toSeq.toDF("myArray")

val rdd1 = df1.rdd.zipWithIndex.map{
  case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df1withId = spark.createDataFrame( rdd1,
  StructType(df1.schema.fields :+ StructField("rowId", LongType, false))
)

val rdd2 = df2.rdd.zipWithIndex.map{
  case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df2withId = spark.createDataFrame( rdd2, 
  StructType(df2.schema.fields :+ StructField("rowId", LongType, false))
)

df1withId.join(df2withId, Seq("rowId")).show
// +-----+------+-------+
// |rowId|myList|myArray|
// +-----+------+-------+
// |    0|     a|      x|
// |    1|     b|      y|
// |    2|     c|      z|
// +-----+------+-------+

Spark withColumn - add column using non-Column type variable

2 Answers2

Linked