Remove Repeated Text

Question

Can someone modify this Regex to remove words as in the example:

This does not work with extra in it below: (<.+?\/>)(?=\1)

<text><text>extra<words><text><words><something>

Should turn into:

<text>extra<words><something>

Thanks

What's your logic? Do you want to drop all but the first occurrence of each ``? And, very important, which language do you want to use this pattern in? — Martin Ender, Aug 20 '13 at 18:54
I would use the regex to match a pattern (the purpose of regex), then add it to an array if the array does not already contain the match. Then I would just implode the array for the output. There are probably other ways, but I think with any method, regex is a component of the solution, not the solution. — twinlakes, Aug 20 '13 at 18:57

score 1 · Accepted Answer · answered Aug 20 '13 at 19:02

This is what I've come up with using lookbehinds and back references:

(<[^>]+>)(?<=\1.*\1)

This will match any instance of <tag> which is preceded by at least one other instance of the same <tag>.

For example, to use this in C#:

var input = "<text><text>extra<words><text><words><something>";
var output Regex.Replace(input, @"(<[^>]+>)(?<=\1.*\1)", "");
Console.WriteLine(output); // <text>extra<words><something>

However, this will not work in many flavors of regex. JavaScript, for example, does not support lookbehinds.

Remove Repeated Text

1 Answers1

Linked