0

I have a spark data frame which after multiple transformations needs to be joined with the one of its parent data frames. This join fails unless i rename a column 'year' as 'year'. I have faced such behavior before as well when after 6-7 transformations the data frame required to be joined with the output of the 3rd transformation.

I couldn't understand why this is happening so i tried random things like persisting, tried using spark sql API instead of pyspark, but still got the same issue. in case of spark sql as well the join worked after renaming the column with the same name

I cannot share the code because of some restrictions but the general code flow is like

  DF =  spark.read(.......)

  subset DF

  df1 = transformation1 on DF 
  df2 = transformation2 on df1

  Subset df2
  df3 = transformation3 on df2

  #this fails 
  final = df2.alias('a').join(  df3.alias('b'),[conditon],'left').select('a.*')

  #this succeeds
  final = df2.withColumnRenamed('Year','Year').alias('a).join(  df3.alias('b'),[conditon],'left').select('a.*')

I cannot provide the stack trace but something like this pops up

     package.TreeNodeException: execute tree:

          Exhange hashpartitioning(.....)

                  remaining logical plan 

I have just recently started with spark and don't really understand what is happening here, so any help would be appreciated

Also this is my first time posting, So any pointers on how to better format the problem is welcome.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Amitoz
  • 30
  • 7

1 Answers1

1

Bugs. I simply rename. It is painful.

See How to resolve the AnalysisException: resolved attribute(s) in Spark. Other scenarios as well.

Also How to rename duplicated columns after join?. Many things on SO in this regard.

Still with latest release Spark 2.4 as well.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83