I'm trying to build some skills in Spark Sql, mainly in DataFrames. So what I'm trying to do is
- join two DataFrames
- apply a udf that takes a tuple of columns
- separate the resulting tuple into two columns
The code I'm using is
import org.apache.spark.sql.functions.udf
val profDF = Seq((1,"James","Detective"),
(2,"Harvey","Captain"),
(3,"Barbara","Club Owner")).toDF("ID", "Name", "Occ")
val persDF = Seq((1, 30, "Single"),
(2, 35, "Married"),
(3, 30, "Single")).toDF("ID", "Age", "Status")
val upperAdd2:(String, Int)=>(String, Int) = (name, age) => (name.toUpperCase, age + 2)
val upAddUDF = udf(upperAdd2)
profDF.join(persDF, profDF("ID")===persDF("ID"))
.withColumn("Result", upAdUDF(profDF("Name"),persDF("Age")))
.withColumn("Caps Name", col("Result._1"))
.withColumn("More Age", col("Result._2"))
.drop("Result")
.show
I get the following error after executing the last statement
org.apache.spark.sql.AnalysisException: Reference 'ID' is ambiguous, could be: ID#4, ID#12.;
I think I've properly specified the DataFrames for "ID" because the following statement works just fine
profDF.join(persDF, profDF("ID")===persDF("ID"))
.withColumn("Result", upAdUDF(profDF("Name"),persDF("Age")))
.show
Is there anything else I need to specify here?
I'm using Spark 1.6.0 and the code is only for learning purposes. Please let me know if it can be improved further.