1

This is similar problem to join name duplication but it can not be resolved using the same techniques as all of those rely on how to evade or prepare for the problem in advance.

So while preparing training materials for my team I added a warning about renaming columns to use the same name as another one and how spark will happily let you do that and than you will end up with...

AnalysisException: Reference 'a' is ambiguous, could be: a#1333L, a#1335L.

... when you try to df.select('a')

So obviously You should evade the problem in the first place or fix the code and rerun it if it occurs but let's imagine hypothetical situation:

You work (interactively in a notebook) on a set of transformations that will calculate for a long time and You cache the results. Only after You start working with the cached results You realise that you made a typo and end up with two columns with the same name. Fix is very simple but recalculations will take a long time and Your boss is pointing at the watch waiting for the results...

What do You do?

Is there any way to fix the column name? I can df.collect() the data into python and fix them there and recreate the DF but the data is huge and it kills the driver. I assume You can go down to RDD level and fix it but my RDD knowledge is so limited that I'm not sure if it's possible that way. Any ideas?

Here is example code that would cause the problem:

df.printSchema()
root
 |-- user: integer (nullable = true)
 |-- trackId: integer (nullable = true)
 |-- artistId: integer (nullable = true)
 |-- timestamp: long (nullable = true)

df.withColumnRenamed('timestamp','user').printSchema()
root
 |-- user: integer (nullable = true)
 |-- trackId: integer (nullable = true)
 |-- artistId: integer (nullable = true)
 |-- user: long (nullable = true)


df.withColumnRenamed('timestamp','user').select('user')
AnalysisException: u"Reference 'user' is ambiguous, could be: user#134, user#248L.;"
Tetlanesh
  • 403
  • 5
  • 9

1 Answers1

2

This should work:

correct_cols = ['user','trackId','artistId','timestamp']
df = df.toDF(*correct_cols)
abeboparebop
  • 7,396
  • 6
  • 37
  • 46