1

I have the join function in Spark SQL. This function needs a join condition and if the columns that we are joining on do not have the same name, they need to be passed as a join expression.

Example:

x.join(y, x.column1 == y.column2)

This means that we are joining dataframes x and y on column1 in x and column2 in y

I would like to write a function that takes the column name for both dataframes as an argument and joins on those columns. The problem is that the join expression cannot be a string. I have looked at questions like this one where a map is used to map a variable name however this does not fit my needs. I need to remove the quotation marks that make the column name a string and pass them to the join function.

I have checked and there is no other way to do this in Pyspark if the columns that we are joining on do not have the same name (besides generating a copy of one of the dataframes with new columns names. This is because dataframes are immutable and column names cannot be changed)

Is there any other way to pass the column names into the join expression?

Community
  • 1
  • 1
Michal
  • 1,863
  • 7
  • 30
  • 50

1 Answers1

0

Re posting my comment as an answer for future reference. You can get any attribute of a class or module using the gettatr function.

x.join(y, getattr(x, 'column1') == getattr(y, 'column2'))
ashwinjv
  • 2,787
  • 1
  • 23
  • 32