Regex: only replace non-nested matches

Question

Given text such as:

This is my [position].
Here are some items:
[items]
    [item]
         Position within the item: [position]
    [/item]
[/items]

Once again, my [position].

I need to match the first and last [position], but not the [position] within [items]...[/items]. Is this doable with a regular expression? So far, all I have is:

Regex.Replace(input, @"\[position\]", "replacement value")

But that is replacing more than I want.

That's not HTML, but it's close enough to reference the obligitory post about parsing HTML with a regular expression anyway. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Wug, Aug 23 '12 at 18:12
Parse the text word by word. if any postion is found inside a nested element ( you have to maintain a flag for this ) ignore it. for others replace the data. This algorithm is quite simple to write. — Shiplu Mokaddim, Aug 23 '12 at 18:16
@Wug I disagree, since OP wants to *exclude* all the [item]...[/item] bits from search. — Qnan, Aug 23 '12 at 18:16
It can be done using a regex, but it is probably not the best way to do it. — Qnan, Aug 23 '12 at 18:21
@shiplu.mokadd.im That may be my last resort, but I'm hoping to find something more efficient. — chrisofspades, Aug 23 '12 at 18:22
@chrisofspades the time you waste to find a regex is much more than the time you need to write the code. Besides for large data size regex is inefficient. — Shiplu Mokaddim, Aug 23 '12 at 18:26
@Wug html is a totally different beast than this due to the many contexts in which it is used and in which it can be invalid but still *work*. — JayC, Aug 23 '12 at 18:27
I posted my comment because you have to handle intentional exclusion of items based on other nested tags, which is a counting problem. Regular expressions can't count, or at least they can't count well. — Wug, Aug 23 '12 at 19:01

score 2 · Accepted Answer · answered Aug 23 '12 at 20:27

As Wug mentioned, regular expressions aren't great at counting. An easier option would be to just find the locations of all of the tokens you're looking for, and then iterate over them and construct your output accordingly. Perhaps something like this:

public string Replace(input, replacement)
{
    // find all the tags
    var regex = new Regex("(\[(?:position|/?item)\])");
    var matches = regex.Matches(input);

    // loop through the tags and build up the output string
    var builder = new StringBuilder();
    int lastIndex = 0;
    int nestingLevel = 0;
    foreach(var match in matches)
    {
        // append everything since the last tag;
        builder.Append(input.Substring(lastIndex, (match.Index - lastIndex) + 1));

        switch(match.Value)
        {
            case "[item]":
                nestingLevel++;
                builder.Append(match.Value);
                break;
            case "[/item]":
                nestingLevel--;
                builder.Append(match.Value);
                break;
            case "[position]":
                // Append the replacement text if we're outside of any [item]/[/item] pairs
                // Otherwise append the tag
                builder.Append(nestingLevel == 0 ? replacement : match.Value);
                break;
        }
        lastIndex = match.Index + match.Length;
    }

    builder.Append(input.Substring(lastIndex));
    return builder.ToString();
}

(Disclaimer: Have not tested. Or even attempted to compile. Apologies in advance for inevitable bugs.)

I was thinking about a similar approach myself, based on @shiplu.mokadd.im's comment above (http://stackoverflow.com/questions/12097672/regex-only-replace-non-nested-matches/12097776#comment16167387_12097672). This may be the best solution since a pure Regex approach doesn't seem viable. — chrisofspades, Aug 23 '12 at 20:43

Phillip Schmidt · Answer 2 · 2012-08-23T18:35:34.183

0

You could maaaaaybe get away with:

Regex.Replace(input,@"(?=\[position\])(!(\[item\].+\[position\].+\[/item\]))","replacement value");

I dunno, I hate ones like this. But this is a job for xml parsing, not regex. If your brackets are really brackets, just search and replace them with carrots, then xml parse.

edited Aug 23 '12 at 18:35

answered Aug 23 '12 at 18:18

Phillip Schmidt

8,805
3
43
67

It would fail an XML parser since there is no root node, and tons of unclosed "tags". I tried that pattern in Expresso and it didn't work. – chrisofspades Aug 23 '12 at 18:28
Use of a literal string would make this more readable. i.e. `@"(?=\[position\])(!(\[item\].+\[position\]\[/item\]))"` – Wug Aug 23 '12 at 18:29
@wug yeah, I would have used it in my code if it was for me, but I did it this way because he did it that way – Phillip Schmidt Aug 23 '12 at 18:30
@PhillipSchmidt I do see one small problem in your pattern here `\[position\]\[/item\]`, which should probably read `\[position\].+\[/item\]`. Even with that modification it still didn't work. – chrisofspades Aug 23 '12 at 18:37
@chrisofspades hang on, I'm actually testing it out now :P – Phillip Schmidt Aug 23 '12 at 18:39

score 0 · Answer 3 · answered Aug 23 '12 at 20:02

What if you check it twice. Like,

s1 = Regex.Replace(input, @"(\[items\])(\w|\W)*(\[\/items\])", "")

This will give you the:

This is my [position].
Here are some items:
Once again, my [position].

As you can see the items section is extracted. And then on s1 you can extract your desired positions. Like,

s2 = Regex.Replace(s1, @"\[position\]", "raplacement_value")

This might not be the best solution. I tried very hard to solve it on regex but not successful.

Interesting suggestion, but I still need to retain the content from `[items]...[/items]`. — chrisofspades, Aug 23 '12 at 20:41

Regex: only replace non-nested matches

3 Answers3