2

I have following dataframes:

accumulated_results_df
 |-- company_id: string (nullable = true)
 |-- max_dd: string (nullable = true)
 |-- min_dd: string (nullable = true)
 |-- count: string (nullable = true)
 |-- mean: string (nullable = true)

computed_df
 |-- company_id: string (nullable = true)
 |-- min_dd: date (nullable = true)
 |-- max_dd: date (nullable = true)
 |-- mean: double (nullable = true)
 |-- count: long (nullable = false)

Trying to do a join using spark-sql as below

 val resultDf = accumulated_results_df.as("a").join(computed_df.as("c"), 
                             ( $"a.company_id" === $"c.company_id" ) && ( $"c.min_dd" > $"a.max_dd" ), "left")

Giving error as :

org.apache.spark.sql.AnalysisException: Reference 'company_id' is ambiguous, could be: a.company_id, c.company_id.;

What am i doing wrong here and How to fix this ?

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
Shasu
  • 458
  • 5
  • 22
  • You should use alias to remove ambiguity. You can look here https://stackoverflow.com/questions/33778664/spark-dataframe-distinguish-columns-with-duplicated-name – Anshul Apr 04 '19 at 08:49
  • 1
    Afterwards, are you trying to select company_id from resultDf? The column could be duplciated. – Shaido Apr 04 '19 at 08:49
  • @Shaido No ... but i am getting columns of two dataframes, but i want only left dataframe columns how to do it ? any idea ? – Shasu Apr 04 '19 at 10:05
  • @Anshul thank you but that is in pyspark right ? i am doing it in scala – Shasu Apr 04 '19 at 10:05

2 Answers2

2

Should work using the col function to reference correctly the alias dataframes and columns

val resultDf = (accumulated_results_df.as("a")
       .join(
           computed_df.as("c"),
           (col("a.company_id") === col("c.company_id")) && (col("c.min_dd") > col("a.max_dd")), 
           "left"
        )

Alex Ortner
  • 1,097
  • 8
  • 24
  • Can you please help me what went wrong here https://stackoverflow.com/questions/71080797/unable-to-write-parquet-file-due-error-sparkexception-job-aborted-fileforma – Shasu Feb 11 '22 at 13:24
1

I have fixed it something like below.

val resultDf = accumulated_results_df.join(computed_df.withColumnRenamed("company_id", "right_company_id").as("c"), 
                             (  accumulated_results_df("company_id" ) === $"c.right_company_id" && ( $"c.min_dd" > accumulated_results_df("max_dd") ) )
                        , "left")
Shasu
  • 458
  • 5
  • 22