1

As an introductory note, I am aware of the old saying about solving problems with regex and I am also aware about the precautions on processing XML with RegEx. But please bear with me for a moment...

I am trying to do a RegEx search and replace on a group of characters. I don't know in advance how often this group will be matched, but I want to search with a certain context only.

An example: If I have the following string "**ab**df**ab**sdf**ab**fdsa**ab**bb" and I want to search for "ab" and replace with "@ab@", this works fine using the following regex:

Search regex:

(.*?)(ab)(.*?)

Replace:

$1@$2@$3

I get four matches in total, as expected. Within each match, the group IDs are the same, so the back-references ($1, $2 ...) work fine, too.

However, if I now add a certain context to the string, the regex above fails:

Search string:

<context>abdfabsdfabfdsaabbb</context>

Search regex:

<context>(.*?)(ab)(.*?)</context>

This will find only the first match. But even if I add a non-capturing group to the original regex, it doesn't work ("<context>(?:(.*?)(ab)(.*?))*</context>").

What I would like is a list of matches as in the first search (without the context), whereby within each match the group IDs are the same.

Any idea how this could be achieved?

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
marw
  • 403
  • 2
  • 8
  • 17
  • I edited the post to correct the formatting for code. Please double check that it is showing correctly. – nhahtdh Jan 29 '14 at 10:43
  • You might want to check this out: http://stackoverflow.com/a/14899550/1400768 – nhahtdh Jan 29 '14 at 10:47
  • I've re-read your question several times, and am really confused about what you're actually asking! Can you perhaps show some more context of what problem you're trying to solve? Are you just trying to replace matched characters in a string, within ? – Tom Lord Jan 29 '14 at 11:42
  • @TomLord, your interpretation is correct. I want to replace a certain string of characters within the . This string of characters may occur zero or more times and I don't know in advance how many matches there will be. – marw Jan 29 '14 at 11:51
  • @nhahtdh, thanks for editing the post. It looks much better now. :) I have read the post you linked to, but I don't see how the use of the \G flag would relate to my problem. – marw Jan 29 '14 at 11:51
  • @marw: Sorry for throwing the post at you without more explanation (since the post is not directly related to your problem). Basically, using a single regex to get more than 1 match inside a tag is impossible without `\G`, since you can't make sure that the engine makes a match only inside the tag you want. There are 2 solutions: use 2 regex: one to take content inside tag and second to match the text you want; OR use a single (complex) regex with `\G` to assert that you are still inside a tag and match items inside it. – nhahtdh Jan 29 '14 at 12:28
  • Tks a lot, nhahtdh, for the clarification. I understand better now how the \G is supposed to work and how it relates to my problem. However, I am not able to get it to work. E.g. what would be the correct RegEx to pull out all "ab" matches (3x) in the string "fdasfabfdasdfabafab"? The following works without the : "(.*?)(ab)(.*?)" – marw Jan 29 '14 at 13:50

1 Answers1

1

Solution

Your requirement is similar to the one in this question: match and capture multiple instances of a pattern between a prefix and a suffix. Using the method as described in this answer of mine:

(?s)(?:<context>|(?!^)\G)(?:(?!</context>|ab).)*ab

Add capturing group as you need.

Caveat

Note that the regex only works for tags that are only allowed to contain only text. If a tag contains other tags, then it won't work correctly.

It also matches ab inside <context> tag without a closing tag </context>. If you want to prevent this then:

(?s)(?:<context>(?=.*?</context>)|(?!^)\G)(?:(?!</context>|ab).)*ab

Explanation

Let us break down the regex:

(?s)                        # Make . matches any character, without exception
(?:
  <context>
    |
  (?!^)\G
)
(?:(?!</context>|ab).)*
ab

(?:<context>|(?!^)\G) makes sure that we either gets inside a new <context> tag, or continue from the previous match and attempt to match more instance of sub-pattern.

(?:(?!</context>|ab).)* match whatever text that we don't care about (not ab) and prevent us from going past the closing tag </context>. Then we match the pattern we want ab at the end.

Community
  • 1
  • 1
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • This is great. Thanks a lot, @nhahtdh. I didn't realize that the \G marker was meant to come after the expression. With the pattern you provide I can now match the required expression. However, I still need to pull out the correct captured groups, so that I can replace the search pattern or insert the original part of the string when pattern was not found. I have tried marking the last "ab" in the string a caputring group by using "(ab)", which gets me halfway there, but still leaves the question of how to get the non-modified parts of the string in to a capturing group. – marw Jan 30 '14 at 11:45
  • @marw: `(?s)((?:|(?!^)\G)(?:(?!|ab).)*)(ab)`. The first capturing group contains whatever you that doesn't match `ab`. – nhahtdh Jan 30 '14 at 14:13