How to apply UDF to collected Rows? (fails with "java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.sql.Column")

Question

I have two dataframe one contains data like this

first dataframe is something like this

+-----+-----------+-----+----------------------+
|value|mergeValues|table|columnName            |
+-----+-----------+-----+----------------------+
|1    |1,2,3      |     |columnName1           |
|2    |4,5,6,7    |     |columnName1           |
|3    |8,9        |     |columnName1           |
|1    |1,2,3      |     |columnName4           |
|2    |4,5,6,7    |     |columnName4           |
|3    |8,9        |     |columnName4           |
|1    |1,2,3      |     |columnName5           |
|2    |4,5,6,7    |     |columnName5           |
|3    |8,9        |     |columnName5           |
|1    |1,2,3      |     |columnName6           |
+-----+-----------+-----+----------------------+

Another dataframe structure is shown below

columnName1 | columnName2 | columnName3 |columnName4 |columnName5 | columnName6 
1               
3
2
4
5

Now I have to create mapped dataframe like this.

Mapping logic is :
 get value from 2ndDF and check firstdf mergeValue if that contains then map to firstdf value.
here value of 2nd df columnName1 is 1 it is present in firstDf mergeValues list map it to firstDf[value] which is 1. same for 2,3,4,5,6,7 ...

columnName1 | columnName2 | columnName3 |columnName4 |columnName5 | columnName6 
1               
1
1
2
2

To do this i'm using UDF way but it is failing, what is the correct way to create this data frame.

My code is:

val firstDF=sparkSession.read.load(first)
val testDF = sparkSession.read.load(test)

val populateColumn: ((String, String, String) => String) = (mergeValues: String, value: String,
                                                            actualValue: String) => {
  if (mergeValues.contains(actualValue.trim)) {
    value
  } else {
    actualValue
  }

}


val populateColumnUdf = udf(populateColumn)

val firstDFList=firstDF.collect
firstDFList.foreach(Case => {
  println(Case)
  testDF.withColumn(Case.getAs("columnName"), populateColumnUdf(Case.getAs("mergeValues"),
    Case.getAs("value"), col(Case.getAs("columnName"))))
})

testDF.show

This is the error i'm getting

java.lang.String cannot be cast to org.apache.spark.sql.Column java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.sql.Column

can you clarify the mapping from 1st data frame to the 2nd dataframe (with the column1 to column5)? — cdncat, Apr 21 '17 at 20:13

score 1 · Answer 1 · answered Apr 21 '17 at 20:46

Check the types in the part where you do the following:

populateColumnUdf(Case.getAs("mergeValues"), Case.getAs("value"), col(Case.getAs("columnName")))

Case is of type Row and getAs gives you the value of a given fieldName. (see org.apache.spark.sql.Row). That's definitely not a Column that your populateColumnUdf expects. In this context, you'd rather use populateColumn Scala function. You've left DataFrame/UDF context and are in Scala only.

score 1 · Accepted Answer · edited May 23 '17 at 12:17

As Jacek Laskowski said, in your code, Case is a org.apache.spark.sql.Row type

Calling getAs on a Row returns the value of this row at a specific field (for example, the value of the first row in the first dataframe is "1,2,3" at the "mergeValues" column)

The withColumn method expects the two parameters. The first parameter is the name of the column you want to replace and the second parameter is the org.apache.spark.sql.Column that is the replacement column.

In the second parameter, is where you provide your udf. The udf takes in columns as parameters. These input columns' datatypes should correspond to the types of the inputs that the function (in this case, populateColumn) that the udf wraps around expects.

Not sure what your col() function does inside the arguments you've provided to your udf in your code.

If I understand your code correctly, you'd looking for something like the following (this code is incomplete and does not run):

val firstDF = sparkSession.read.load(first)
val testDF = sparkSession.read.load(test)

val populateColumn: ((String, String, String) => String) =
    (mergeValues: String, value: String, actualValue: String) => {
    if (mergeValues.contains(actualValue.trim)) {
      value
    } else {
      actualValue
    }
  }

 val populateColumnUdf = udf(populateColumn)

 val replacementCol = new Column("columnName1")

//mergeValuesCol and valueCol needs to be the columns from firstDF
testDF.withColumn("columnName1", populateColumnUdf(mergeValuesCol, valueCol, replacementCol))

You need to provide values from the external columns (mergeValues and value) of firstDF (See Passing a data frame column and external list to udf under withColumn for a reference).

Or maybe look into merging/joining two dataframes given a condition.

Hope this helps somewhat! Thank you to the people who edited the question so I could understand what the question was asking :)

How to apply UDF to collected Rows? (fails with "java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.sql.Column")

2 Answers2