How to exclude multiple columns in Spark dataframe in Python

Question

I found PySpark has a method called drop but it seems it can only drop one column at a time. Any ideas about how to drop multiple columns at the same time?

df.drop(['col1','col2'])

TypeError                                 Traceback (most recent call last)
<ipython-input-96-653b0465e457> in <module>()
----> 1 selectedMachineView = machineView.drop([['GpuName','GPU1_TwoPartHwID']])

/usr/hdp/current/spark-client/python/pyspark/sql/dataframe.pyc in drop(self, col)
   1257             jdf = self._jdf.drop(col._jc)
   1258         else:
-> 1259             raise TypeError("col should be a string or a Column")
   1260         return DataFrame(jdf, self.sql_ctx)
   1261 

TypeError: col should be a string or a Column

score 68 · Answer 1 · edited Dec 12 '21 at 18:44

68

In PySpark 2.1.0 method drop supports multiple columns:

PySpark 2.0.2:

DataFrame.drop(col)

PySpark 2.1.0:

DataFrame.drop(*cols)

Example:

df.drop('col1', 'col2')

or using the * operator as

df.drop(*['col1', 'col2'])

edited Dec 12 '21 at 18:44

Sheldore

37,862
7
57
71

answered Feb 04 '17 at 18:02

Patrick Z

2,119
1
16
10

I have a scenario where am using – Rups N Jul 19 '18 at 05:29
11

Just to be clear, in case it isn't obvious to some folks landing here, when @Patrick writes `DataFrame.drop(*cols)` above, `cols` **is a Python list**, and [putting the star before it converts it into positional arguments](https://stackoverflow.com/questions/2921847/what-does-the-star-operator-mean). – Mike Williamson Oct 09 '18 at 02:09

zero323 · Accepted Answer · 2017-02-24T15:11:56.290

Simply with select:

df.select([c for c in df.columns if c not in {'GpuName','GPU1_TwoPartHwID'}])

or if you really want to use drop then reduce should do the trick:

from functools import reduce
from pyspark.sql import DataFrame

reduce(DataFrame.drop, ['GpuName','GPU1_TwoPartHwID'], df)

Note:

(difference in execution time):

There should be no difference when it comes to data processing time. While these methods generate different logical plans physical plans are exactly the same.

There is a difference however when we analyze driver-side code:

the first method makes only a single JVM call while the second one has to call JVM for each column that has to be excluded
the first method generates logical plan which is equivalent to physical plan. In the second case it is rewritten.
finally comprehensions are significantly faster in Python than methods like map or reduce
Spark 2.x+ supports multiple columns in drop. See SPARK-11884 (Drop multiple columns in the DataFrame API) and SPARK-12204 (Implement drop method for DataFrame in SparkR) for detials.

score 9 · Answer 3 · edited Feb 17 '21 at 08:57

9

The right way to do this is:

df.drop(*['col1', 'col2', 'col3'])

The * needs to come outside of the brackets if there are multiple columns to drop.

edited Feb 17 '21 at 08:57

techytushar

673
5
17

answered May 13 '19 at 17:36

Ceren

101
1
5

This doesn't add any new information to this post.The `*` unpacking is shown in [this answer](https://stackoverflow.com/a/42043699/5858851) with further explanation of the syntax in [this comment](https://stackoverflow.com/questions/35674490/how-to-exclude-multiple-columns-in-spark-dataframe-in-python#comment92350606_42043699) – pault May 13 '19 at 18:20
The answer you point to does not work for me: df.drop('col1', 'col2') is incorrect, the columns have to be in brackets and the * needs to be outside the bracket. That's why I posted. – Ceren May 14 '19 at 20:20
1

If it's not working for you, your error is somewhere else because the `df.drop(*['col1', 'col2'])` is syntactically equivalent to `df.drop('col1', 'col2')` – pault May 14 '19 at 20:24
@pault you're right. For some reason, your method didn't work for me earlier but now it does. In any case, the * is necessary if you do decide to use brackets, so I think it's fair to keep the answer here as a potential alternative solution. Thanks. – Ceren Jul 06 '20 at 17:53
@Ceren: How to make this changes happened in the dataframe ? Like it does in python inplace=True, then change is reflected in the dataframe. as noticed df.drop(*cols) returns new dataframe. – Innovator-programmer Nov 19 '21 at 09:55

score 0 · Answer 4 · answered Jan 17 '22 at 16:13

0

In case non of the above works for you, try this:

df.drop(col("col1")).drop(col("col2))

My spark version is 3.1.2.

answered Jan 17 '22 at 16:13

pari

788
8
12

How to exclude multiple columns in Spark dataframe in Python

4 Answers4

Linked

Related