13

I have a dataframe df,

import pandas as pd

df = pd.DataFrame(
    {
        "ID": [1, 2, 3, 4, 5],
        "name": [
            "Hello Kitty",
            "Hello Puppy",
            "It is an Helloexample",
            "for stackoverflow",
            "Hello World",
        ],
    }
)

which looks like:

   ID               name
0   1        Hello Kitty
1   2        Hello Puppy
2   3   It is an Helloexample
3   4  for stackoverflow
4   5        Hello World

I have a list of strings To_remove_list

To_remove_lst = ["Hello", "for", "an", "It"]

I need to remove all the strings present in the list from the column name of df. How can I do this in pandas ?

My expected answer is:

   ID               name
0   1              Kitty
1   2              Puppy
2   3              is example
3   4              stackoverflow
4   5              World
cs95
  • 379,657
  • 97
  • 704
  • 746
Archit
  • 542
  • 1
  • 4
  • 15

3 Answers3

19

I think need str.replace if want remove also substrings:

df['name'] = df['name'].str.replace('|'.join(To_remove_lst), '')

If possible some regex characters:

import re
df['name'] = df['name'].str.replace('|'.join(map(re.escape, To_remove_lst)), '')

print (df)
   ID            name
0   1           Kitty
1   2           Puppy
2   3     is  example
3   4   stackoverflow
4   5           World

But if want remove only words use nested list comprehension:

df['name'] = [' '.join([y for y in x.split() if y not in To_remove_lst]) for x in df['name']]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
6

I'd recommend re.sub in a list comprehension for speed.

import re
p = re.compile('|'.join(map(re.escape, To_remove_lst)))
df['name'] = [p.sub('', text) for text in df['name']] 

print (df)
   ID            name
0   1           Kitty
1   2           Puppy
2   3     is  example
3   4   stackoverflow
4   5           World

List comprehensions are implemented in C and operate in C speed. I highly recommend list comprehensions when working with string and regex data over pandas str functions for the time-being because the API is a bit slow.

The use of map(re.escape, To_remove_lst) is to escape any possible regex metacharacters which are meant to be treated literally during replacement.

The pattern is precompiled before calling regex.sub to reduce the overhead of compilation at each iteration.

I've also let it slide but please use PEP-8 compliant variable names "to_remove_lst" (lower-snake case).


Timings

df = pd.concat([df] * 10000)
%timeit df['name'].str.replace('|'.join(To_remove_lst), '')
%timeit [p.sub('', text) for text in df['name']] 

100 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
60 ms ± 3.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
cs95
  • 379,657
  • 97
  • 704
  • 746
  • does the re.escape replace metacharacters if they are contained in my string? I don't know what you mean by "escape" and have looked at the documentation. Could you provide an example, please? – codingInMyBasement Oct 25 '19 at 22:42
  • 1
    @acodejdatam If your string contains any [metacharacters](https://docs.python.org/3/howto/regex.html#matching-characters) that you want to be treated literally (instead of being parsed as metacharacters by the regex engine), then that's when you'd want `re.escape` here. There's a good explanation in the [docs](https://docs.python.org/3/library/re.html#re.escape) as well. – cs95 Oct 25 '19 at 23:18
0

You can run a for loop for each element and then use str.replace

for WORD in To_remove_lst:
    df['name'] = df['name'].str.replace(WORD, '')

Output:

   ID            name
0   1           Kitty
1   2           Puppy
2   3     is  example
3   4   stackoverflow
4   5           World
LOrD_ARaGOrN
  • 3,884
  • 3
  • 27
  • 49