Python regex for quickly removing any possible instances of substrings?

Question

Given some string:

"this is a test string ababababa"

And some "stop" words to remove:

"rin" "hi" "tri" "ababa"

The resulting string becomes

"ts is a test sg "

So I basically want to do the equivalent of the following (efficiently with a regex):

For each stop word, slide it across the main string, one character at a time. Any time there's a pattern match, I "mark" the corresponding letters for removal later.
After all stop words have been processed, remove any letters from the main string that are marked.

Check the functionality of re.findall, it comes close, but may not be what you describe w.r.t. overlapping matches — Paddy3118, Aug 16 '15 at 05:17
The overlapping matches is the main reason I made the question (I had already tried re.findall although unsuccessfully) — user262329, Aug 16 '15 at 05:18
I'm not sure there's any reasonable way to do this except "by hand" (i.e. using the `pos` argument to [`match()`](https://docs.python.org/2/library/re.html#re.RegexObject.match)). — Kevin, Aug 16 '15 at 05:33
@Kevin Can this code be modified to do overlapping? http://stackoverflow.com/a/6117124/5231588 — user262329, Aug 16 '15 at 05:47
@user262329: Not easily. `sub()` finds non-overlapping matches. You would have to totally re-implement it. — Kevin, Aug 16 '15 at 05:49
I had looked up the flags for sub() and couldn't find anything either, darn. — user262329, Aug 16 '15 at 05:53
That sounds more complicated than just calling `match()` in a loop with progressively larger `pos` values (or doing repeated string slicing, but YMMV on performance in that case). If you figure out a DFA/NFA solution, be sure to answer your question with it, so the rest of us can read about it. — Kevin, Aug 16 '15 at 06:08

lemonhead · Accepted Answer · 2015-08-16T06:48:08.907

2

Assuming you have

orig_str = 'this is a test string ababababa'
stopwords = ['rin', 'hi', 'tri', 'ababa']

I think this should do what you want. The main trick is to use the non-consuming matching syntax (?=...) to find all the overlapping matches:

# Update: sort stop words so that words prefixing 
# other stop words don't cause issues
stopwords.sort(key=lambda x: len(x), reverse=True)
pattern = re.compile('|'.join('(?=(%s))' % word for word in stopwords))
matches = pattern.finditer(orig_str)

Once you have the matches, you can do something like this to remove matching characters. Not sure it's the most efficient, but it works:

mask = [True] * len(orig_str) # True means keep this char
for match in matches:
    # mark all matches' positions in the mask
    wordlen = max(len(word) for word in match.groups() if word)
    mask[match.start():match.start()+wordlen] = [False] * wordlen

new_str = "".join(c for (c,i) in zip(orig_str, mask) if i)
# new_str:
# 'ts is a test sg '

edited Aug 16 '15 at 06:48

answered Aug 16 '15 at 06:02

lemonhead

5,328
1
13
25

Damn, I forgot about lookahead. Yeah, this is probably the right solution. – Kevin Aug 16 '15 at 06:09
This unfortunately doesn't work in all overlap cases. Try orig_str = "fullword" with stopwords ["full","fullword"] -- it only removes "full" when it should be removing the whole thing. It seems like the order of the stopwords matters (whereas ideally it shouldn't) – user262329 Aug 16 '15 at 06:26
Ah yes, good point. Issue is when one stop word is a prefix of another, the `|` will always match the first and then short circuit. I worked around this by sorting the stopwords by length, so that the earlier precondition in the pattern are always longer or equal length. Updated answer – lemonhead Aug 16 '15 at 06:48

Python regex for quickly removing any possible instances of substrings?

1 Answers1