Regex: removing multiple words from a string using "or" regex removes some words but not all

Question

I have this code:

pattern = " LLC | CO | CORP | DIV "
col = pd.Series(' TEST LLC TEST CO TEST CORP TEST DIV ')
col = col.map(lambda x:re.sub(pattern, '  ', x))
col

which correctly gives: TEST TEST TEST TEST

However, when I replace the 2nd row with:

col = pd.Series(' COMPUTING DEVICES CO DIV CLDC ')

I get incorrect output: COMPUTING DEVICES DIV CLDC

So in the first example, DIV was correctly removed. However in the second example DIV was not removed even though the code is identical?

Any idea why? and how to fix it? Many thanks in advance!

sshashank124 · Accepted Answer · 2020-01-05T06:09:41.160

4

This is because in your pattern you are matching certain keywords along with the whitespace (spaces) surrounding it. Therefore, in your second example, first the standalone " CO " gets matched. But by this point, your pattern has also consumed the space after the "CO". Now the remaining string to be matched is DIV ""CLDC""". Your pattern only accepts the keywords if they have surrounding whitespace, but your DIV in this case doesn't have a space before it.

 COMPUTING DEVICES CO DIV CLDC 
 |___________________|  # already consumed
                      |______|  # remaining string

A better way to accomplish what you want, is to match on word boundaries using the \b specifier (this is so that you don't pick up the CO in COMPUTING

Here is the modified regex with an online example:

\b(LLC|CO|CORP|DIV)\b

This will match your keywords if they are standalone words and they will match without the whitespace. You can then replace these matches with an empty string '' instead of ' '

edited Jan 05 '20 at 06:09

answered Jan 05 '20 at 06:03

sshashank124

31,495
9
67
76

Got it! great explanation, thanks. However I tried your method and it's not removing anything. Here's the code ```pattern = '\b(LLC|CO|CORP|DIV)\b' col = pd.Series(' COMPUTING DEVICES CO DIV LLC CLDC ') col = col.map(lambda x:re.sub(pattern, ' ', x)) col``` The result I'm getting is: 0 COMPUTING DEVICES CO DIV LLC CLDC – Chadee Fouad Jan 05 '20 at 06:18
1

You need to escape the backslash or use a raw string: `pattern = r'\b(LLC|CO|CORP|DIV)\b'`. Note the `r''` for specifying a raw string – sshashank124 Jan 05 '20 at 06:20
Yup that worked! However, the minor problem is that now I'm getting the unwanted double spaces that were left when the 'CO' for example was removed...can the regex take those out? or do I need a loop to replace double spaces? – Chadee Fouad Jan 05 '20 at 06:25
I cannot see the spaces in your comment. Please surround your string with `\`\`` backticks – sshashank124 Jan 05 '20 at 06:27
``0 COMPUTING DEVICES CLDC `` I did but still stack is removing the extra spaces...there are like 6 spaces between DEVICES and CLDC because 3 words were removed and for each there was space before and after – Chadee Fouad Jan 05 '20 at 06:28
Looks fine? The space at the end was also there in the original string. You can use `strip()` to trip whitespaces from the ends – sshashank124 Jan 05 '20 at 06:30
The 6 spaces are in between DEVICE and CLDC so strip() can't fix it. I can fix it with a loop but that's not very elegant...thanks a lot for your help anyway :-) – Chadee Fouad Jan 05 '20 at 06:31
You can modify the regex to optionally include 0 or more spaces before or after the pattern by using `\s*`. I'm sure there's some variation that will fit your need – sshashank124 Jan 05 '20 at 06:32
1

```col = col.map(lambda x: re.sub(' +',' ', x)) ``` removes any 2 or more spaces – Chadee Fouad Jan 05 '20 at 07:28
your solution used ```pattern = r'\b(LLC|CO|CORP|DIV)\b'``` however what happens if the pattern is not explicitly mentioned? i.e.let's say that ```pattern``` is actually being composed out of a csv file and accordingly the 'r' (raw text) command will not work in this case. Any suggestions how can I make it work in that case? Thanks! :-) – Chadee Fouad May 24 '20 at 06:34

Regex: removing multiple words from a string using "or" regex removes some words but not all

1 Answers1