remove for loop for df.drop

Question

I am working with pyspark 2.0.

My code is :

for col in to_exclude: 
    df = df.drop(col)

I cannot do directly df = df.drop(*to_exclude) because in 2.0, drop method accept only 1 column at a time.

Is there a way to change my code and remove the for loop ?

score 1 · Accepted Answer · answered Jun 04 '18 at 09:58

First of all - worries not. Even if you do it in loop, it does not mean Spark executes separate queries for each drop. Queries are lazy, so it will build one big execution plan first, and then executes everything at once. (but you probably know it anyway)

However, if you still want to get rid of the loop within 2.0 API, I’d go with something opposite to what you’ve implemented: instead of dropping columns, select only needed:

df.select([col for col in df.columns if col not in to_exclude])

Of course, use select instead of drops... works fine. thx – Steven Jun 04 '18 at 09:59 — Steven, Jun 04 '18 at 09:59

remove for loop for df.drop

1 Answers1