33

I have a large number of columns in a PySpark dataframe, say 200. I want to select all the columns except say 3-4 of the columns. How do I select this columns without having to manually type the names of all the columns I want to select?

Tshilidzi Mudau
  • 7,373
  • 6
  • 36
  • 49
  • use `drop` with columns you'd like to exclude. – Vamsi Prabhala Jun 13 '18 at 13:14
  • 3
    `df.select([c for c in df.columns if c not in {'GpuName','GPU1_TwoPartHwID'}])` – vvg Jun 13 '18 at 14:18
  • 2
    Possible duplicate of [How to exclude multiple columns in Spark dataframe in Python](https://stackoverflow.com/questions/35674490/how-to-exclude-multiple-columns-in-spark-dataframe-in-python) – vvg Jun 13 '18 at 14:18

3 Answers3

58

In the end, I settled for the following :

  • Drop:

    df.drop('column_1', 'column_2', 'column_3')

  • Select :

    df.select([c for c in df.columns if c not in {'column_1', 'column_2', 'column_3'}])

Tshilidzi Mudau
  • 7,373
  • 6
  • 36
  • 49
2

this might be helpful

df_cols = list(set(df.columns) - {'<col1>','<col2>',....})

df.select(df_cols).show()
sairamdgr8
  • 47
  • 1
  • 10
1
df.drop(*[cols for cols in [list of columns to drop]])

Useful if the list to drop columns is huge. or if the list can be derived programmatically.

martand
  • 13
  • 4