Hi I'm trying to join two dataframes in spark, and I'm getting the following error:
org.apache.spark.sql.AnalysisException: Reference 'Adapazari' is ambiguous,
could be: Adapazari#100064, Adapazari#100065.;
According to several sources, this can occur when you try to join two different dataframes together that both have a column with the same name (1, 2, 3). However, in my case, that is not the source of the error. I can tell because (1) my columns all have different names, and (2) the reference
indicated in the error is a value contained within the join column.
My code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark = SparkSession
.builder().master("local")
.appName("Spark SQL basic example")
.config("master", "spark://myhost:7077")
.getOrCreate()
val sqlContext = spark.sqlContext
import sqlContext.implicits._
val people = spark.read.json("/path/to/people.jsonl")
.select($"city", $"gender")
.groupBy($"city")
.pivot("gender")
.agg(count("*").alias("total"))
.drop("0")
.withColumnRenamed("1", "female")
.withColumnRenamed("2", "male")
.na.fill(0)
val cities = spark.read.json("/path/to/cities.jsonl")
.select($"name", $"longitude", $"latitude")
cities.join(people, $"name" === $"city", "inner")
.count()
Everything works great until I hit the join line, and then I get the aforementioned error.
The relevant lines in build.sbt
are:
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "2.1.0",
"org.apache.spark" % "spark-sql_2.10" % "2.1.0",
"com.databricks" % "spark-csv_2.10" % "1.5.0",
"org.apache.spark" % "spark-mllib_2.10" % "2.1.0"
)