How can I remove a substring from a given String using Pandas

Question

Recently I started to analyse a data frame and I want to remove all the substrings that don't contain

('Aparelho Celular','Internet (Serviços e Produtos)','Serviços Telefônicos Diversos','Telefonia Celular','Telefonia Comunitária ( PABX, DDR, Etc. )','Telefonia Fixa','TV por Assinatura','Televisão / Aparelho DVD / Filmadora','Telemarketing')

But when I use this syntax-

df = df[~df["GrupoAssunto"].str.contains('Aparelho Celular','Internet (Serviços e Produtos)','Serviços Telefônicos Diversos','Telefonia Celular','Telefonia Comunitária ( PABX, DDR, Etc. )','Telefonia Fixa','TV por Assinatura','Televisão / Aparelho DVD / Filmadora','Telemarketing')]

I get this error:

TypeError: contains() takes from 2 to 6 positional arguments but 10 were given

Can you provide a sample of your dataframe and expected output? — Shradha, Nov 30 '20 at 21:16
Does this answer your question? [How to test if a string contains one of the substrings in a list, in pandas?](https://stackoverflow.com/questions/26577516/how-to-test-if-a-string-contains-one-of-the-substrings-in-a-list-in-pandas) — noah, Nov 30 '20 at 21:25
Please update the question to clarify the use-case, along with a sample of the dataset and expected output. — S3DEV, Nov 30 '20 at 21:29

score 0 · Answer 1 · answered Nov 30 '20 at 21:21

Use the .isin() function instead.

For example:

vals1 = ['good val1', 'good val2', 'good val3', 'Aparelho Celular','Internet (Serviços e Produtos)','Serviços Telefônicos Diversos','Telefonia Celular','Telefonia Comunitária ( PABX, DDR, Etc. )','Telefonia Fixa','TV por Assinatura','Televisão / Aparelho DVD / Filmadora','Telemarketing']
vals2 = ['Aparelho Celular','Internet (Serviços e Produtos)','Serviços Telefônicos Diversos','Telefonia Celular','Telefonia Comunitária ( PABX, DDR, Etc. )','Telefonia Fixa','TV por Assinatura','Televisão / Aparelho DVD / Filmadora','Telemarketing']

df = pd.DataFrame({'col1': vals1})

Using the negated .isin() function will provide a view on the DataFrame excluding the values in the vals2 list.:

df[~df['col1'].isin(vals2)]

Output:

        col1
0  good val1
1  good val2
2  good val3

This accomplishes something else. `isin()` doesn't work for substrings in the same way as `str.contains`. The OP's issue seems to just be syntax based — noah, Nov 30 '20 at 21:27
@noah - Ok, fair enough. OP needs to update the question to be more explicit, and provide a use-case. (Commented accordingly.) — S3DEV, Nov 30 '20 at 21:28

score 0 · Answer 2 · answered Nov 30 '20 at 21:28

Just seperate the different words by | with regex turned on. This is the proper syntax for searching for multiple strings with contains. The re safe conversion deals with escaping the parenthesis and any other special characters.

bad_strings = ['Aparelho Celular','Internet (Serviços e Produtos)','Serviços Telefônicos Diversos','Telefonia Celular','Telefonia Comunitária ( PABX, DDR, Etc. )','Telefonia Fixa','TV por Assinatura','Televisão / Aparelho DVD / Filmadora','Telemarketing']
safe_bad_strings = [re.escape(s) for s in bad_strings]
df = df[~df["GrupoAssunto"].str.contains('|'.join(safe_bad_strings), regex=True]

Your error is occurring because the 10 strings are all being passed as arguments to contains. But contains doesn't expect more than one pattern so it is throwing an error.

How can I remove a substring from a given String using Pandas

2 Answers2

Linked