I have the join function in Spark SQL. This function needs a join condition and if the columns that we are joining on do not have the same name, they need to be passed as a join expression.
Example:
x.join(y, x.column1 == y.column2)
This means that we are joining dataframes x
and y
on column1
in x
and column2
in y
I would like to write a function that takes the column name for both dataframes as an argument and joins on those columns. The problem is that the join expression cannot be a string. I have looked at questions like this one where a map is used to map a variable name however this does not fit my needs. I need to remove the quotation marks that make the column name a string and pass them to the join
function.
I have checked and there is no other way to do this in Pyspark if the columns that we are joining on do not have the same name (besides generating a copy of one of the dataframes with new columns names. This is because dataframes are immutable and column names cannot be changed)
Is there any other way to pass the column names into the join expression?