0

Recently I started to analyse a data frame and I want to remove all the substrings that don't contain

('Aparelho Celular','Internet (Serviços e Produtos)','Serviços Telefônicos Diversos','Telefonia Celular','Telefonia Comunitária ( PABX, DDR, Etc. )','Telefonia Fixa','TV por Assinatura','Televisão / Aparelho DVD / Filmadora','Telemarketing')

But when I use this syntax-

df = df[~df["GrupoAssunto"].str.contains('Aparelho Celular','Internet (Serviços e Produtos)','Serviços Telefônicos Diversos','Telefonia Celular','Telefonia Comunitária ( PABX, DDR, Etc. )','Telefonia Fixa','TV por Assinatura','Televisão / Aparelho DVD / Filmadora','Telemarketing')]

I get this error:

TypeError: contains() takes from 2 to 6 positional arguments but 10 were given
  • 1
    Can you provide a sample of your dataframe and expected output? – Shradha Nov 30 '20 at 21:16
  • Does this answer your question? [How to test if a string contains one of the substrings in a list, in pandas?](https://stackoverflow.com/questions/26577516/how-to-test-if-a-string-contains-one-of-the-substrings-in-a-list-in-pandas) – noah Nov 30 '20 at 21:25
  • Please update the question to clarify the use-case, along with a sample of the dataset and expected output. – S3DEV Nov 30 '20 at 21:29

2 Answers2

0

Use the .isin() function instead.

For example:

vals1 = ['good val1', 'good val2', 'good val3', 'Aparelho Celular','Internet (Serviços e Produtos)','Serviços Telefônicos Diversos','Telefonia Celular','Telefonia Comunitária ( PABX, DDR, Etc. )','Telefonia Fixa','TV por Assinatura','Televisão / Aparelho DVD / Filmadora','Telemarketing']
vals2 = ['Aparelho Celular','Internet (Serviços e Produtos)','Serviços Telefônicos Diversos','Telefonia Celular','Telefonia Comunitária ( PABX, DDR, Etc. )','Telefonia Fixa','TV por Assinatura','Televisão / Aparelho DVD / Filmadora','Telemarketing']

df = pd.DataFrame({'col1': vals1})

Using the negated .isin() function will provide a view on the DataFrame excluding the values in the vals2 list.:

df[~df['col1'].isin(vals2)]

Output:

        col1
0  good val1
1  good val2
2  good val3
S3DEV
  • 8,768
  • 3
  • 31
  • 42
  • This accomplishes something else. `isin()` doesn't work for substrings in the same way as `str.contains`. The OP's issue seems to just be syntax based – noah Nov 30 '20 at 21:27
  • @noah - Ok, fair enough. OP needs to update the question to be more explicit, and provide a use-case. (Commented accordingly.) – S3DEV Nov 30 '20 at 21:28
0

Just seperate the different words by | with regex turned on. This is the proper syntax for searching for multiple strings with contains. The re safe conversion deals with escaping the parenthesis and any other special characters.

bad_strings = ['Aparelho Celular','Internet (Serviços e Produtos)','Serviços Telefônicos Diversos','Telefonia Celular','Telefonia Comunitária ( PABX, DDR, Etc. )','Telefonia Fixa','TV por Assinatura','Televisão / Aparelho DVD / Filmadora','Telemarketing']
safe_bad_strings = [re.escape(s) for s in bad_strings]
df = df[~df["GrupoAssunto"].str.contains('|'.join(safe_bad_strings), regex=True]

Your error is occurring because the 10 strings are all being passed as arguments to contains. But contains doesn't expect more than one pattern so it is throwing an error.

noah
  • 2,616
  • 13
  • 27