Pyspark: Select all columns except particular columns

Question

I have a large number of columns in a PySpark dataframe, say 200. I want to select all the columns except say 3-4 of the columns. How do I select this columns without having to manually type the names of all the columns I want to select?

`df.select([c for c in df.columns if c not in {'GpuName','GPU1_TwoPartHwID'}])` — vvg, Jun 13 '18 at 14:18
Possible duplicate of [How to exclude multiple columns in Spark dataframe in Python](https://stackoverflow.com/questions/35674490/how-to-exclude-multiple-columns-in-spark-dataframe-in-python) — vvg, Jun 13 '18 at 14:18

Tshilidzi Mudau · Accepted Answer · 2018-12-06T04:37:00.640

58

In the end, I settled for the following :

Drop:

df.drop('column_1', 'column_2', 'column_3')
Select :

df.select([c for c in df.columns if c not in {'column_1', 'column_2', 'column_3'}])

edited Dec 06 '18 at 04:37

answered Sep 04 '18 at 07:05

Tshilidzi Mudau

7,373
6
36
49

score 2 · Answer 2 · answered Sep 09 '22 at 15:51

2

this might be helpful

df_cols = list(set(df.columns) - {'<col1>','<col2>',....})

df.select(df_cols).show()

answered Sep 09 '22 at 15:51

sairamdgr8

47
1
10

score 1 · Answer 3 · answered Sep 13 '21 at 17:04

1

df.drop(*[cols for cols in [list of columns to drop]])

Useful if the list to drop columns is huge. or if the list can be derived programmatically.

answered Sep 13 '21 at 17:04

martand

13
4

Pyspark: Select all columns except particular columns

3 Answers3

this might be helpful