3

Given some string:

"this is a test string ababababa"

And some "stop" words to remove:

"rin" "hi" "tri" "ababa"

The resulting string becomes

"ts is a test sg "

So I basically want to do the equivalent of the following (efficiently with a regex):

  1. For each stop word, slide it across the main string, one character at a time. Any time there's a pattern match, I "mark" the corresponding letters for removal later.

  2. After all stop words have been processed, remove any letters from the main string that are marked.

user262329
  • 33
  • 4
  • Made into regex question instead – user262329 Aug 16 '15 at 05:08
  • Check the functionality of re.findall, it comes close, but may not be what you describe w.r.t. overlapping matches – Paddy3118 Aug 16 '15 at 05:17
  • The overlapping matches is the main reason I made the question (I had already tried re.findall although unsuccessfully) – user262329 Aug 16 '15 at 05:18
  • I'm not sure there's any reasonable way to do this except "by hand" (i.e. using the `pos` argument to [`match()`](https://docs.python.org/2/library/re.html#re.RegexObject.match)). – Kevin Aug 16 '15 at 05:33
  • @Kevin Can this code be modified to do overlapping? http://stackoverflow.com/a/6117124/5231588 – user262329 Aug 16 '15 at 05:47
  • @user262329: Not easily. `sub()` finds non-overlapping matches. You would have to totally re-implement it. – Kevin Aug 16 '15 at 05:49
  • I had looked up the flags for sub() and couldn't find anything either, darn. – user262329 Aug 16 '15 at 05:53
  • That sounds more complicated than just calling `match()` in a loop with progressively larger `pos` values (or doing repeated string slicing, but YMMV on performance in that case). If you figure out a DFA/NFA solution, be sure to answer your question with it, so the rest of us can read about it. – Kevin Aug 16 '15 at 06:08

1 Answers1

2

Assuming you have

orig_str = 'this is a test string ababababa'
stopwords = ['rin', 'hi', 'tri', 'ababa']

I think this should do what you want. The main trick is to use the non-consuming matching syntax (?=...) to find all the overlapping matches:

# Update: sort stop words so that words prefixing 
# other stop words don't cause issues
stopwords.sort(key=lambda x: len(x), reverse=True)
pattern = re.compile('|'.join('(?=(%s))' % word for word in stopwords))
matches = pattern.finditer(orig_str)

Once you have the matches, you can do something like this to remove matching characters. Not sure it's the most efficient, but it works:

mask = [True] * len(orig_str) # True means keep this char
for match in matches:
    # mark all matches' positions in the mask
    wordlen = max(len(word) for word in match.groups() if word)
    mask[match.start():match.start()+wordlen] = [False] * wordlen

new_str = "".join(c for (c,i) in zip(orig_str, mask) if i)
# new_str:
# 'ts is a test sg '
lemonhead
  • 5,328
  • 1
  • 13
  • 25
  • Damn, I forgot about lookahead. Yeah, this is probably the right solution. – Kevin Aug 16 '15 at 06:09
  • This unfortunately doesn't work in all overlap cases. Try orig_str = "fullword" with stopwords ["full","fullword"] -- it only removes "full" when it should be removing the whole thing. It seems like the order of the stopwords matters (whereas ideally it shouldn't) – user262329 Aug 16 '15 at 06:26
  • Ah yes, good point. Issue is when one stop word is a prefix of another, the `|` will always match the first and then short circuit. I worked around this by sorting the stopwords by length, so that the earlier precondition in the pattern are always longer or equal length. Updated answer – lemonhead Aug 16 '15 at 06:48