0

I am working with pyspark 2.0.

My code is :

for col in to_exclude: 
    df = df.drop(col)

I cannot do directly df = df.drop(*to_exclude) because in 2.0, drop method accept only 1 column at a time.

Is there a way to change my code and remove the for loop ?

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
Steven
  • 14,048
  • 6
  • 38
  • 73

1 Answers1

1

First of all - worries not. Even if you do it in loop, it does not mean Spark executes separate queries for each drop. Queries are lazy, so it will build one big execution plan first, and then executes everything at once. (but you probably know it anyway)

However, if you still want to get rid of the loop within 2.0 API, I’d go with something opposite to what you’ve implemented: instead of dropping columns, select only needed:

df.select([col for col in df.columns if col not in to_exclude])
vvg
  • 6,325
  • 19
  • 36