I have created 2 dataframe as below:
df_flights = spark1.read.parquet('domestic-flights\\flights.parquet')
df_airport_codes = spark1.read.load('domestic-flights\\flights.csv',format="csv",sep=",",inferSchema=True,header=True)
I then referenced the databricks guide to not get duplicate columns https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html
df3=df_flights.join(df_airport_codes,"origin_airport_code", 'left')
When I try to sort by any of the columns which were in both dataframes I am still getting the same error
Py4JJavaError: An error occurred while calling o1553.filter.
: org.apache.spark.sql.AnalysisException: Reference 'passengers' is ambiguous, could be: passengers, passengers.;
OR if I attempt a sort:
df3.sort('passengers')
Py4JJavaError: An error occurred while calling o1553.sort.: org.apache.spark.sql.AnalysisException: cannot resolve '`passengers`' given input columns: [flights, destination_population, origin_city, distance, passengers, seats, flights, origin_population, passengers, flight_datetime, origin_air_port_code, flight_year, seats, origin_city, destination_city, destination_city, destination_airport_code, destination_airport_code, origin_population, destination_population, flight_month, distance];;
The question is, is there an error with my join logic? If not, how do I alias the ambiguous column?