52

I found PySpark has a method called drop but it seems it can only drop one column at a time. Any ideas about how to drop multiple columns at the same time?

df.drop(['col1','col2'])
TypeError                                 Traceback (most recent call last)
<ipython-input-96-653b0465e457> in <module>()
----> 1 selectedMachineView = machineView.drop([['GpuName','GPU1_TwoPartHwID']])

/usr/hdp/current/spark-client/python/pyspark/sql/dataframe.pyc in drop(self, col)
   1257             jdf = self._jdf.drop(col._jc)
   1258         else:
-> 1259             raise TypeError("col should be a string or a Column")
   1260         return DataFrame(jdf, self.sql_ctx)
   1261 

TypeError: col should be a string or a Column
zero323
  • 322,348
  • 103
  • 959
  • 935
MYjx
  • 4,157
  • 9
  • 38
  • 53

4 Answers4

68

In PySpark 2.1.0 method drop supports multiple columns:

PySpark 2.0.2:

DataFrame.drop(col)

PySpark 2.1.0:

DataFrame.drop(*cols)

Example:

df.drop('col1', 'col2')

or using the * operator as

df.drop(*['col1', 'col2'])
Sheldore
  • 37,862
  • 7
  • 57
  • 71
Patrick Z
  • 2,119
  • 1
  • 16
  • 10
  • I have a scenario where am using – Rups N Jul 19 '18 at 05:29
  • 11
    Just to be clear, in case it isn't obvious to some folks landing here, when @Patrick writes `DataFrame.drop(*cols)` above, `cols` **is a Python list**, and [putting the star before it converts it into positional arguments](https://stackoverflow.com/questions/2921847/what-does-the-star-operator-mean). – Mike Williamson Oct 09 '18 at 02:09
57

Simply with select:

df.select([c for c in df.columns if c not in {'GpuName','GPU1_TwoPartHwID'}])

or if you really want to use drop then reduce should do the trick:

from functools import reduce
from pyspark.sql import DataFrame

reduce(DataFrame.drop, ['GpuName','GPU1_TwoPartHwID'], df)

Note:

(difference in execution time):

There should be no difference when it comes to data processing time. While these methods generate different logical plans physical plans are exactly the same.

There is a difference however when we analyze driver-side code:

  • the first method makes only a single JVM call while the second one has to call JVM for each column that has to be excluded
  • the first method generates logical plan which is equivalent to physical plan. In the second case it is rewritten.
  • finally comprehensions are significantly faster in Python than methods like map or reduce
  • Spark 2.x+ supports multiple columns in drop. See SPARK-11884 (Drop multiple columns in the DataFrame API) and SPARK-12204 (Implement drop method for DataFrame in SparkR) for detials.
zero323
  • 322,348
  • 103
  • 959
  • 935
9

The right way to do this is:

df.drop(*['col1', 'col2', 'col3'])

The * needs to come outside of the brackets if there are multiple columns to drop.

techytushar
  • 673
  • 5
  • 17
Ceren
  • 101
  • 1
  • 5
  • This doesn't add any new information to this post.The `*` unpacking is shown in [this answer](https://stackoverflow.com/a/42043699/5858851) with further explanation of the syntax in [this comment](https://stackoverflow.com/questions/35674490/how-to-exclude-multiple-columns-in-spark-dataframe-in-python#comment92350606_42043699) – pault May 13 '19 at 18:20
  • The answer you point to does not work for me: df.drop('col1', 'col2') is incorrect, the columns have to be in brackets and the * needs to be outside the bracket. That's why I posted. – Ceren May 14 '19 at 20:20
  • 1
    If it's not working for you, your error is somewhere else because the `df.drop(*['col1', 'col2'])` is syntactically equivalent to `df.drop('col1', 'col2')` – pault May 14 '19 at 20:24
  • @pault you're right. For some reason, your method didn't work for me earlier but now it does. In any case, the * is necessary if you do decide to use brackets, so I think it's fair to keep the answer here as a potential alternative solution. Thanks. – Ceren Jul 06 '20 at 17:53
  • @Ceren: How to make this changes happened in the dataframe ? Like it does in python inplace=True, then change is reflected in the dataframe. as noticed df.drop(*cols) returns new dataframe. – Innovator-programmer Nov 19 '21 at 09:55
0

In case non of the above works for you, try this:

df.drop(col("col1")).drop(col("col2))

My spark version is 3.1.2.

pari
  • 788
  • 8
  • 12