How to remove strings present in a list from a column in pandas

Question

I have a dataframe df,

import pandas as pd

df = pd.DataFrame(
    {
        "ID": [1, 2, 3, 4, 5],
        "name": [
            "Hello Kitty",
            "Hello Puppy",
            "It is an Helloexample",
            "for stackoverflow",
            "Hello World",
        ],
    }
)

which looks like:

   ID               name
0   1        Hello Kitty
1   2        Hello Puppy
2   3   It is an Helloexample
3   4  for stackoverflow
4   5        Hello World

I have a list of strings To_remove_list

To_remove_lst = ["Hello", "for", "an", "It"]

I need to remove all the strings present in the list from the column name of df. How can I do this in pandas ?

My expected answer is:

   ID               name
0   1              Kitty
1   2              Puppy
2   3              is example
3   4              stackoverflow
4   5              World

Does `to_remove_lst` contain full words or can it contain substrings? — cs95, Aug 03 '18 at 06:20
You may want to state that up front or else 90% of the answers here will be useless to you. — cs95, Aug 03 '18 at 06:22
I hope you took a look at all the answers and actually tested them out on your data instead of accepting based on the number of votes (sometimes that can be misleading). — cs95, Aug 03 '18 at 06:37

jezrael · Accepted Answer · 2018-08-03T06:29:27.100

19

I think need str.replace if want remove also substrings:

df['name'] = df['name'].str.replace('|'.join(To_remove_lst), '')

If possible some regex characters:

import re
df['name'] = df['name'].str.replace('|'.join(map(re.escape, To_remove_lst)), '')

print (df)
   ID            name
0   1           Kitty
1   2           Puppy
2   3     is  example
3   4   stackoverflow
4   5           World

But if want remove only words use nested list comprehension:

df['name'] = [' '.join([y for y in x.split() if y not in To_remove_lst]) for x in df['name']]

edited Aug 03 '18 at 06:29

answered Aug 03 '18 at 06:24

jezrael

822,522
95
1,334
1,252

What if there are regex characters in my strings I want to remove as well? – codingInMyBasement Oct 25 '19 at 22:40
2

You always provide a variety of answers/options in all your posts +1! – Jeru Luke May 12 '22 at 14:36

cs95 · Answer 2 · 2018-08-03T06:31:41.913

I'd recommend re.sub in a list comprehension for speed.

import re
p = re.compile('|'.join(map(re.escape, To_remove_lst)))
df['name'] = [p.sub('', text) for text in df['name']] 

print (df)
   ID            name
0   1           Kitty
1   2           Puppy
2   3     is  example
3   4   stackoverflow
4   5           World

List comprehensions are implemented in C and operate in C speed. I highly recommend list comprehensions when working with string and regex data over pandas str functions for the time-being because the API is a bit slow.

The use of map(re.escape, To_remove_lst) is to escape any possible regex metacharacters which are meant to be treated literally during replacement.

The pattern is precompiled before calling regex.sub to reduce the overhead of compilation at each iteration.

I've also let it slide but please use PEP-8 compliant variable names "to_remove_lst" (lower-snake case).

Timings

df = pd.concat([df] * 10000)
%timeit df['name'].str.replace('|'.join(To_remove_lst), '')
%timeit [p.sub('', text) for text in df['name']] 

100 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
60 ms ± 3.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

does the re.escape replace metacharacters if they are contained in my string? I don't know what you mean by "escape" and have looked at the documentation. Could you provide an example, please? — codingInMyBasement, Oct 25 '19 at 22:42
@acodejdatam If your string contains any [metacharacters](https://docs.python.org/3/howto/regex.html#matching-characters) that you want to be treated literally (instead of being parsed as metacharacters by the regex engine), then that's when you'd want `re.escape` here. There's a good explanation in the [docs](https://docs.python.org/3/library/re.html#re.escape) as well. — cs95, Oct 25 '19 at 23:18

score 0 · Answer 3 · answered Nov 15 '19 at 04:33

You can run a for loop for each element and then use str.replace

for WORD in To_remove_lst:
    df['name'] = df['name'].str.replace(WORD, '')

Output:

   ID            name
0   1           Kitty
1   2           Puppy
2   3     is  example
3   4   stackoverflow
4   5           World

How to remove strings present in a list from a column in pandas

3 Answers3

Linked