I have a spark data frame which after multiple transformations needs to be joined with the one of its parent data frames. This join fails unless i rename a column 'year' as 'year'. I have faced such behavior before as well when after 6-7 transformations the data frame required to be joined with the output of the 3rd transformation.
I couldn't understand why this is happening so i tried random things like persisting, tried using spark sql API instead of pyspark, but still got the same issue. in case of spark sql as well the join worked after renaming the column with the same name
I cannot share the code because of some restrictions but the general code flow is like
DF = spark.read(.......)
subset DF
df1 = transformation1 on DF
df2 = transformation2 on df1
Subset df2
df3 = transformation3 on df2
#this fails
final = df2.alias('a').join( df3.alias('b'),[conditon],'left').select('a.*')
#this succeeds
final = df2.withColumnRenamed('Year','Year').alias('a).join( df3.alias('b'),[conditon],'left').select('a.*')
I cannot provide the stack trace but something like this pops up
package.TreeNodeException: execute tree:
Exhange hashpartitioning(.....)
remaining logical plan
I have just recently started with spark and don't really understand what is happening here, so any help would be appreciated
Also this is my first time posting, So any pointers on how to better format the problem is welcome.