2

I am trying to replace the Nth appearance of a needle in a haystack. I want to do this simply via re.sub(), but cannot seem to come up with an appropriate regex to solve this. I am trying to adapt: http://docstore.mik.ua/orelly/perl/cookbook/ch06_06.htm but am failing at spanning multilines, I suppose.

My current method is an iterative approach that finds the position of each occurrence from the beginning after each mutation. This is pretty inefficient and I would like to get some input. Thanks!

Brandon Lorenz
  • 211
  • 2
  • 13

6 Answers6

3

I think you mean re.sub. You could pass a function and keep track of how often it was called so far:

def replaceNthWith(n, replacement):
    def replace(match, c=[0]):
        c[0] += 1
        return replacement if c[0] == n else match.group(0)
    return replace

Usage:

re.sub(pattern, replaceNthWith(n, replacement), str)

But this approach feels a bit hacky, maybe there are more elegant ways.

DEMO

Felix Kling
  • 795,719
  • 175
  • 1,089
  • 1,143
  • I looked at re.sub, but it didnt appear to have a way of replacing the Nth occurence, only all, or the first X ocurrences. So instead of making it work how I want, I thought it was simpler and clearer to take (to me) the obvious steps using findall/start&end etc.. – Matt Warren Aug 24 '11 at 21:53
  • @Matt: You are right, it does not have such a way built-in. With a function though you can get the desired effect. It might not be efficient though, as it actually replaces every occurrence (mostly with itself). – Felix Kling Aug 24 '11 at 21:55
2

Something like this regex should help you. Though I'm not sure how efficient it is:

#N=3   
re.sub(
  r'^((?:.*?mytexttoreplace){2}.*?)mytexttoreplace',
  '\1yourreplacementtext.', 
  'mystring',
  flags=re.DOTALL
)

The DOTALL flag is important.

Jacob Eggers
  • 9,062
  • 2
  • 25
  • 43
1

I've been struggling for a while with this, but I found a solution that I think is pretty pythonic:

>>> def nth_matcher(n, replacement):
...     def alternate(n):
...         i=0
...         while True:
...             i += 1
...             yield i%n == 0
...     gen = alternate(n)
...     def match(m):
...         replace = gen.next()
...         if replace:
...             return replacement
...         else:
...             return m.group(0)
...     return match
...     
... 
>>> re.sub("([0-9])", nth_matcher(3, "X"), "1234567890")
'12X45X78X0'

EDIT: the matcher consists of two parts:

  1. the alternate(n) function. This returns a generator that returns an infinite sequence True/False, where every nth value is True. Think of it like list(alternate(3)) == [False, False, True, False, False, True, False, ...].

  2. The match(m) function. This is the function that gets passed to re.sub: it gets the next value in alternate(n) (gen.next()) and if it's True it replaces the matched value; otherwise, it keeps it unchanged (replaces it with itself).

I hope this is clear enough. If my explanation is hazy, please say so and I'll improve it.

Gabi Purcaru
  • 30,940
  • 9
  • 79
  • 95
  • This is interesting, although I am not quite sure how it is working. I can see that from the result it replaces every third occurrence until the end of the haystack. If I could understand the details of how this was working, I could add constraints to end after it successfully replaces an occurrence. Would it be possible to add some explanation? This looks like it could be the best answer. – Brandon Lorenz Aug 25 '11 at 13:16
  • I'd like to tie in http://stackoverflow.com/questions/231767/the-python-yield-keyword-explained to this answer. It explains generators very well. I have a solid understanding of how this works thanks to your explanation and the other SO question. Also, is this closure in action? – Brandon Lorenz Aug 25 '11 at 17:00
  • @bdilly yes. The `gen` variable is the trick: it's used inside `match`, but it is initialized in the outer scope – Gabi Purcaru Aug 25 '11 at 18:27
0

If the pattern ("needle") or replacement is a complex regular expression, you can't assume anything. The function "nth_occurrence_sub" is what I came up with as a more general solution:

def nth_match_end(pattern, string, n, flags):
    for i, match_object in enumerate(re.finditer(pattern, string, flags)):
        if i + 1 == n:
            return match_object.end()


def nth_occurrence_sub(pattern, repl, string, n=0, flags=0):
    max_n = len(re.findall(pattern, string, flags))
    if abs(n) > max_n or n == 0:
        return string
    if n < 0:
        n = max_n + n + 1
    sub_n_times = re.sub(pattern, repl, string, n, flags)
    if n == 1:
        return sub_n_times
    nm1_end = nth_match_end(pattern, string, n - 1, flags)
    sub_nm1_times = re.sub(pattern, repl, string, n - 1, flags)
    sub_nm1_change = sub_nm1_times[:-1 * len(string[nm1_end:])]
    components = [
        string[:nm1_end],
        sub_n_times[len(sub_nm1_change):]
        ]
    return ''.join(components)
Ted Striker
  • 71
  • 1
  • 6
0

I have a similar function I wrote to do this. I was trying to replicate SQL REGEXP_REPLACE() functionality. I ended up with:

def sql_regexp_replace( txt, pattern, replacement='', position=1, occurrence=0, regexp_modifier='c'):
    class ReplWrapper(object):
        def __init__(self, replacement, occurrence):
            self.count = 0
            self.replacement = replacement
            self.occurrence = occurrence
        def repl(self, match):
            self.count += 1
            if self.occurrence == 0 or self.occurrence == self.count:
                return match.expand(self.replacement)
            else: 
                try:
                    return match.group(0)
                except IndexError:
                    return match.group(0)
    occurrence = 0 if occurrence < 0 else occurrence
    flags = regexp_flags(regexp_modifier)
    rx = re.compile(pattern, flags)
    replw = ReplWrapper(replacement, occurrence)
    return txt[0:position-1] + rx.sub(replw.repl, txt[position-1:])

One important note that I haven't seen mentioned is that you need to return match.expand() otherwise it won't expand the \1 templates properly and will treat them as literals.

If you want this to work you'll need to handle the flags differently (or take it from my github, it's simple to implement and you can dummy it for a test by setting it to 0 and ignoring my call to regexp_flags()).

woot
  • 7,406
  • 2
  • 36
  • 55
0

Could you do it using re.findall with MatchObject.start() and MatchObject.end()?

find all occurences of pattern in string with .findall, get indices of Nth occurrence with .start/.end, make new string with replacement value using the indices?

Matt Warren
  • 669
  • 8
  • 18