0

Can someone modify this Regex to remove words as in the example:

This does not work with extra in it below: (<.+?\/>)(?=\1)

<text><text>extra<words><text><words><something>

Should turn into:

<text>extra<words><something>

Thanks

u_ser__
  • 45
  • 7
  • 4
    What's your logic? Do you want to drop all but the first occurrence of each ``? And, very important, which language do you want to use this pattern in? – Martin Ender Aug 20 '13 at 18:54
  • I would use the regex to match a pattern (the purpose of regex), then add it to an array if the array does not already contain the match. Then I would just implode the array for the output. There are probably other ways, but I think with any method, regex is a component of the solution, not the solution. – twinlakes Aug 20 '13 at 18:57

1 Answers1

1

This is what I've come up with using lookbehinds and back references:

(<[^>]+>)(?<=\1.*\1)

This will match any instance of <tag> which is preceded by at least one other instance of the same <tag>.

For example, to use this in C#:

var input = "<text><text>extra<words><text><words><something>";
var output Regex.Replace(input, @"(<[^>]+>)(?<=\1.*\1)", "");
Console.WriteLine(output); // <text>extra<words><something>

However, this will not work in many flavors of regex. JavaScript, for example, does not support lookbehinds.

p.s.w.g
  • 146,324
  • 30
  • 291
  • 331