1

Hi I'm trying to join two dataframes in spark, and I'm getting the following error:

org.apache.spark.sql.AnalysisException: Reference 'Adapazari' is ambiguous, 
could be: Adapazari#100064, Adapazari#100065.;

According to several sources, this can occur when you try to join two different dataframes together that both have a column with the same name (1, 2, 3). However, in my case, that is not the source of the error. I can tell because (1) my columns all have different names, and (2) the reference indicated in the error is a value contained within the join column.

My code:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession
  .builder().master("local")
  .appName("Spark SQL basic example")
  .config("master", "spark://myhost:7077")
  .getOrCreate()

val sqlContext = spark.sqlContext

import sqlContext.implicits._

val people = spark.read.json("/path/to/people.jsonl")
  .select($"city", $"gender")
  .groupBy($"city")
  .pivot("gender")
  .agg(count("*").alias("total"))
  .drop("0")
  .withColumnRenamed("1", "female")
  .withColumnRenamed("2", "male")
  .na.fill(0)

val cities = spark.read.json("/path/to/cities.jsonl")
  .select($"name", $"longitude", $"latitude")

cities.join(people, $"name" === $"city", "inner")
  .count()

Everything works great until I hit the join line, and then I get the aforementioned error.

The relevant lines in build.sbt are:

libraryDependencies ++= Seq(
  "org.apache.spark" % "spark-core_2.10" % "2.1.0",
  "org.apache.spark" % "spark-sql_2.10" % "2.1.0",
  "com.databricks" % "spark-csv_2.10" % "1.5.0",
  "org.apache.spark" % "spark-mllib_2.10" % "2.1.0"
)
Community
  • 1
  • 1
Logister
  • 1,852
  • 23
  • 26
  • fyi, everything is working fine up to the join because spark is lazy, and when you are calling "count()", you are executing the dataframe. It would probably help to see an example of the json for both cities and people. – Derek_M Feb 13 '17 at 22:18
  • @Derek_M Your question led me to do a deeper analysis of the data. It turns out that some of the JSONL was malformed. If you like, you can answer the question with "your JSON is likely bad" and I'll give you the answer points. – Logister Feb 13 '17 at 22:37
  • Ha, no worries. Glad you figured it out! – Derek_M Feb 14 '17 at 01:57
  • @Logister please add your comment as an answer, it is helpful to know it might be a data issue – Ivan Virabyan Apr 28 '17 at 12:57
  • In my case the error was caused because the key was named the same in both dataframes. So thanks. Any insight about why is this? – hipoglucido May 09 '17 at 13:51

1 Answers1

0

It turned out that this error was due to malformed JSONL. Fixing the JSONL formatting solved the problem.

Logister
  • 1,852
  • 23
  • 26