1

I have a dataframe:

Product Storage Price
Azure (2.4% Server £540
AWS Server £640
GCP Server £540

I would like to remove the column which contains the string '(2.4%' however I only want to remove the column in Pandas through regex if regex finds either a bracket or percentage in the string in that column '(%' and then pandas should drop that column entirely.

Please can you help me find a way to use regex to search for special characters within a string and drop the column if that condition is met?

I've searched on stack/google. I've used the following so far:

df = df.drop([col for col in df.columns if df[col].eq('(%').any()], axis=1)

chars = '(%'
regex = f'[{"".join(map(re.escape, chars))}]'

df = df.loc[:, ~df.apply(lambda c: c.str.contains(regex).any())]

however neither of these worked.

Any help would be greatly appreciated. :)

Thank You * Insert Smiley*

Yuca
  • 6,010
  • 3
  • 22
  • 42
  • Please provide a Minimal, Reproducible Example https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – sayan dasgupta Sep 14 '22 at 15:10

3 Answers3

1

I would do something like this

import pandas as pd
from io import StringIO

text = """
Product,Perc,Storage,Price
Azure,(2.4%,Server,£540
AWS,,Server,£640
GCP,,Server,£540
"""
data = pd.read_csv(StringIO(text))
print(data)

drop_columns = list()
for col_name in data.columns:
    has_special_characters = data[col_name].str.contains("[\(%]")
    if has_special_characters.any():
        drop_columns.append(col_name)

print(f"Dropping {drop_columns}")
data.drop(drop_columns, axis=1, inplace=True)
print(data)

Output of the script is:

  Product   Perc Storage Price
0   Azure  (2.4%  Server  £540
1     AWS    NaN  Server  £640
2     GCP    NaN  Server  £540
Dropping ['Perc']
  Product Storage Price
0   Azure  Server  £540
1     AWS  Server  £640
2     GCP  Server  £540

Process finished with exit code 0
Eelco van Vliet
  • 1,198
  • 12
  • 18
  • Thank you for your response and for taking the time to help me find a solution! :) – user19795989 Sep 14 '22 at 16:04
  • No problem. Btw: the answer by to_data is not correct. He should have used a regular expression, his expression evaluates to False. I tried it and the column with the (2.4% is not dropped. For the rest it is identical as my answer, except he uses a one-liner, which generally leads to code that is harder to read. – Eelco van Vliet Sep 14 '22 at 19:56
  • And I dont want to be a spoiler, but the answer by Scaro0974 is also not correct in case you have None values in your data frame (as is the case if you read the data from a csv file using read_csv). The None values will raise a TypeError in the regular expression. Moreover, iterating over the elements of a column is generally much slower than the inbuilt string evaluation on the whole column, as I did. So I come to the conclusion that I have given the only correct answer :-) – Eelco van Vliet Sep 14 '22 at 20:03
  • Hi Eelco, the only problem with this solution is I have a predefined list with many elements some 2,000+ and when I create a dataFrame the column holding the 2.4% doesn't have a column header. So I can't predefine a name for that column, unless I create a positional argument first in Pandas to change the name to Perc. – user19795989 Sep 15 '22 at 11:59
  • I have given the name 'Perc' in the header of the 'csv file', I don't use it in the script, because there it gets assigned to the variable 'col_name'. In case I would have used an empty field in the csv file, the read_csv file would have automatically assigned a name to it as 'Unnamed: 1'. If you have an empty string as column name, you can still use this script. You only have to be careful that you don'nt have mulitple columns with the same name. It is better to assign names to all your columns – Eelco van Vliet Sep 16 '22 at 07:54
1

you re using eq function it check exactly if the value in the columun match % instead of eq do this

df.drop([col for col in df.columns if df[col].apply(lambda x:'(%' in str(x)).any()], axis=1,inplace=True)
Mouad Slimane
  • 913
  • 3
  • 12
1

You can try this (I guess the name of the column you want to drop is ""):

import re

change_col = False
for elem in df[""]:
    if re.search(r'[(%]', elem):
        change_col = True

if change_col:
    df = df.drop("", axis=1)
Scaro974
  • 199
  • 5
  • Thank you for your response and for taking the time to help me find a solution, this worked and was the second neatest solution! :) – user19795989 Sep 14 '22 at 16:05
  • The empty elements in your column will raise a TypeError, you cannot assume the empty values are empty strings (normally these are represented by nan values) – Eelco van Vliet Sep 14 '22 at 20:08