PySpark drop columns based on column names / String condition

Question

I want to drop columns in a pyspark dataframe that contains any of the words in the banned_columns list and form a new dataframe out of the remaining columns

banned_columns = ["basket","cricket","ball"]
drop_these = [columns_to_drop for columns_to_drop in df.columns if columns_to_drop in banned_columns]

df_new = df.drop(*drop_these)

The idea of banned_columns is to drop any columns that start with basket and cricket, and columns that contain the word ball anywhere in their name.

The above is what I did so far, but it does not work (as in the new dataframe still contains those columns names)

Example of dataframe

 sports1basketjump | sports

In the above column name example, it will drop the column sports1basketjump because it contains the word basket.

Moreover, is using the filter or/and reduce functions adds optimization than creating list and for loops?

@Wen Hi Wen ! I do not think that axis exists in pyspark ? or ? — PolarBear10, Jul 16 '18 at 14:53

fluxens · Accepted Answer · 2018-07-17T08:47:11.397

5

Your list comprehension does not do what you expect it to do. It will return an empty list, unless it exactly matches a string. For an answer on how to match a list of substrings with a list of strings check out matching list of substrings to a list of strings in Python

The df.drop(*cols) will work as you expect.

edited Jul 17 '18 at 08:47

answered Jul 16 '18 at 16:55

fluxens

565
3
15

Since this answer was helpful to some, I would rather link the question. – fluxens Jul 20 '18 at 07:45
good point, feel free to tweak the question a little bit :) so the answer is more relevent – PolarBear10 Jul 20 '18 at 07:58
will do, can you please link your new q/a so I can link it? – fluxens Jul 20 '18 at 08:10
here it is https://stackoverflow.com/questions/51322445/how-to-drop-all-columns-with-null-values-in-a-pyspark-dataframe – PolarBear10 Jul 20 '18 at 08:12

score 0 · Answer 2 · answered Apr 19 '23 at 19:29

Assuming dataframe df that needs to have some columns dropped, first build a regular expression that will match the banned columns substrings. That can be done by basically combining the string values with |. The resulting value for pattern is "basket|cricket|ball".

import re
banned_columns = ["basket","cricket","ball"]
pattern = "|".join(re.escape(s) for s in banned_columns)

Now build the regular expression and store it in a variable for use in the filter.

crexp = re.compile(pattern)

The complete list of column names df.columns is filtered with the filter command. The result of the filter command is an enumerator so use list to enumerate it into the desired list of column names.

drop_these = list(filter(lambda s: (crexp.search(s)), df.columns))

Finally, drop the unwanted columns.

df_new = df.drop(*drop_these)

PySpark drop columns based on column names / String condition

2 Answers2

Linked