8

I am trying to join two dataframes.

I created aliases and referenced them according to this post: Spark Dataframe distinguish columns with duplicated name

But I am still getting an error about ambiguous columns when it hits the fillna() function, even though I do not reference anything about RetailUnits.

alloc_ns = allocation_num_spots.alias('alloc_ns')
avails_ns = avails_num_spots.alias('avails_ns')
compare_num_avails_inv = avails_ns.join(
    alloc_ns, 
        (F.col('avails_ns.BreakDateTime') == F.col('alloc_ns.AllocationDateTime')) & 
        (F.col('avails_ns.RetailUnit') == F.col('alloc_ns.RetailUnit')), 
        how='left').fillna(
    {'allocs_sum': 0}).withColumn(
    'diff', F.col('avails_sum') - F.col('allocs_sum'))

But I keep getting this error:

"Reference 'RetailUnit' is ambiguous, could be: avails_ns.RetailUnit, alloc_ns.RetailUnit.;"
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1664, in fillna
    return DataFrame(self._jdf.na().fill(value), self.sql_ctx)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Reference 'RetailUnit' is ambiguous, could be: avails_ns.RetailUnit, alloc_ns.RetailUnit.;"

Here is the output of just the join:

+-------------------+----------+----------+-------------------+----------+----------+
|      BreakDateTime|RetailUnit|avails_sum| AllocationDateTime|RetailUnit|allocs_sum|
+-------------------+----------+----------+-------------------+----------+----------+
|2019-05-09 04:15:00|      CFAB|        60|               null|      null|      null|
|2019-05-09 12:31:00|      CFAB|        60|               null|      null|      null|
|2019-05-09 03:15:00|      CFAB|        60|               null|      null|      null|
|2019-05-09 22:45:00|      CFPK|        60|2019-05-09 22:45:00|      CFPK|       120|
|2019-05-09 00:45:00|      CFPK|       120|2019-05-09 00:45:00|      CFPK|        30|
|2019-05-09 02:31:00|      CFPK|        60|               null|      null|      null|
|2019-05-09 13:45:00|      CFPK|        60|2019-05-09 13:45:00|      CFPK|        30|
|2019-05-09 14:15:00|      CFPK|        60|               null|      null|      null|
|2019-05-09 23:45:00|      CFPK|        60|2019-05-09 23:45:00|      CFPK|       120|
+-------------------+----------+----------+-------------------+----------+----------+

Could someone explain why this isn't working? I do not want to create separate columns since they should join on the RetailUnit.

ZygD
  • 22,092
  • 39
  • 79
  • 102
sanjayr
  • 1,679
  • 2
  • 20
  • 41

2 Answers2

8

Use .drop function and drop the column after joining the dataframe .drop(alloc_ns.RetailUnit)

compare_num_avails_inv = avails_ns.join(
    alloc_ns, 
        (F.col('avails_ns.BreakDateTime') == F.col('alloc_ns.AllocationDateTime')) & 
        (F.col('avails_ns.RetailUnit') == F.col('alloc_ns.RetailUnit')), 
        how='left').drop(alloc_ns.RetailUnit).fillna(
    {'allocs_sum': 0}).withColumn(
    'diff', F.col('avails_sum') - F.col('allocs_sum'))

Even though we are not referring to the ambiguous column but fillna will traverse through column names then throwing exception of ambiguous columns.

notNull
  • 30,258
  • 4
  • 35
  • 50
6

You should rename the duplicate column

compare_num_avails_inv = (
    avails_ns.join(
      alloc_ns, 
      (F.col('avails_ns.BreakDateTime') == F.col('alloc_ns.AllocationDateTime')) & (F.col('avails_ns.RetailUnit') == F.col('alloc_ns.RetailUnit')), 
      how='left'
    )
    .withColumnRenamed(alloc_ns.RetailUnit, 'RetailUnitNs')
    .fillna({'allocs_sum': 0})
    .withColumn('diff', F.col('avails_sum') - F.col('allocs_sum'))
)

This way you don't need to drop the column if it is required

Jaroslav Bezděk
  • 6,967
  • 6
  • 29
  • 46
Shubham Jain
  • 5,327
  • 2
  • 15
  • 38