3

I'm trying to build some skills in Spark Sql, mainly in DataFrames. So what I'm trying to do is

  • join two DataFrames
  • apply a udf that takes a tuple of columns
  • separate the resulting tuple into two columns

The code I'm using is

import org.apache.spark.sql.functions.udf

val profDF = Seq((1,"James","Detective"),
                 (2,"Harvey","Captain"),
                 (3,"Barbara","Club Owner")).toDF("ID", "Name", "Occ")
val persDF = Seq((1, 30, "Single"),
                 (2, 35, "Married"),
                 (3, 30, "Single")).toDF("ID", "Age", "Status")

val upperAdd2:(String, Int)=>(String, Int) = (name, age) => (name.toUpperCase, age + 2)

val upAddUDF = udf(upperAdd2)

profDF.join(persDF, profDF("ID")===persDF("ID"))
    .withColumn("Result", upAdUDF(profDF("Name"),persDF("Age")))
    .withColumn("Caps Name", col("Result._1"))
    .withColumn("More Age", col("Result._2"))
    .drop("Result")
    .show

I get the following error after executing the last statement

org.apache.spark.sql.AnalysisException: Reference 'ID' is ambiguous, could be: ID#4, ID#12.;

I think I've properly specified the DataFrames for "ID" because the following statement works just fine

profDF.join(persDF, profDF("ID")===persDF("ID"))
    .withColumn("Result", upAdUDF(profDF("Name"),persDF("Age")))
    .show

Is there anything else I need to specify here?

I'm using Spark 1.6.0 and the code is only for learning purposes. Please let me know if it can be improved further.

Amber
  • 914
  • 6
  • 20
  • 51

0 Answers0