I want to drop columns in a pyspark dataframe that contains any of the words in the banned_columns
list and form a new dataframe out of the remaining columns
banned_columns = ["basket","cricket","ball"]
drop_these = [columns_to_drop for columns_to_drop in df.columns if columns_to_drop in banned_columns]
df_new = df.drop(*drop_these)
The idea of banned_columns
is to drop any columns that start with basket
and cricket
, and columns that contain the word ball
anywhere in their name.
The above is what I did so far, but it does not work (as in the new dataframe still contains those columns names)
Example of dataframe
sports1basketjump | sports
In the above column name example, it will drop the column sports1basketjump
because it contains the word basket.
Moreover, is using the filter
or/and reduce
functions adds optimization than creating list and for loops?