2

I want to drop columns in a pyspark dataframe that contains any of the words in the banned_columns list and form a new dataframe out of the remaining columns

banned_columns = ["basket","cricket","ball"]
drop_these = [columns_to_drop for columns_to_drop in df.columns if columns_to_drop in banned_columns]

df_new = df.drop(*drop_these)

The idea of banned_columns is to drop any columns that start with basket and cricket, and columns that contain the word ball anywhere in their name.

The above is what I did so far, but it does not work (as in the new dataframe still contains those columns names)

Example of dataframe

 sports1basketjump | sports

In the above column name example, it will drop the column sports1basketjump because it contains the word basket.

Moreover, is using the filter or/and reduce functions adds optimization than creating list and for loops?

PolarBear10
  • 2,065
  • 7
  • 24
  • 55

2 Answers2

5

Your list comprehension does not do what you expect it to do. It will return an empty list, unless it exactly matches a string. For an answer on how to match a list of substrings with a list of strings check out matching list of substrings to a list of strings in Python

The df.drop(*cols) will work as you expect.

fluxens
  • 565
  • 3
  • 15
0

Assuming dataframe df that needs to have some columns dropped, first build a regular expression that will match the banned columns substrings. That can be done by basically combining the string values with |. The resulting value for pattern is "basket|cricket|ball".

import re
banned_columns = ["basket","cricket","ball"]
pattern = "|".join(re.escape(s) for s in banned_columns)

Now build the regular expression and store it in a variable for use in the filter.

crexp = re.compile(pattern)

The complete list of column names df.columns is filtered with the filter command. The result of the filter command is an enumerator so use list to enumerate it into the desired list of column names.

drop_these = list(filter(lambda s: (crexp.search(s)), df.columns))

Finally, drop the unwanted columns.

df_new = df.drop(*drop_these)
Joshcodes
  • 8,513
  • 5
  • 40
  • 47