5

Suppose I had a string, "cats cats cats and dogs dogs dogs."

What regular expression would I use in order to replace that string with,"cats and dogs." i.e. removing duplicates. The expression however must only remove duplicates that follow after each other. For instance:

"cats cats cats and dogs dogs dogs and cats cats and dogs dogs"

Would return:

"cats and dogs and cats and dogs"

Immanu'el Smith
  • 683
  • 2
  • 8
  • 18
  • Check out http://stackoverflow.com/questions/1058783/regular-expression-to-find-and-remove-duplicate-words it might give you some pointers on your question. – Jason Evans Jun 10 '10 at 13:21

4 Answers4

9
resultString = Regex.Replace(subjectString, @"\b(\w+)(?:\s+\1\b)+", "$1");

will do all replacements in one single call.

Explanation:

\b                 # assert that we are at a word boundary
                   # (we only want to match whole words)
(\w+)              # match one word, capture into backreference #1
(?:                # start of non-capturing, repeating group
   \s+             # match at least one space
   \1              # match the same word as previously captured
   \b              # as long as we match it completely
)+                 # do this at least once
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
2

Replace (\w+)\s+\1 with $1

Do this in a loop until no more matches are found. Setting the global flag is not enough as it wouldn't replace third cats in cats cats cats

\1 in regex refers to the contents of the first captured group.

Try:

str = "cats cats cats and dogs dogs dogs and cats cats and dogs dogs";
str = Regex.Replace(str, @"(\b\w+\b)\s+(\1(\s+|$))+", "$1 ");
Console.WriteLine(str);
Amarghosh
  • 58,710
  • 11
  • 92
  • 121
  • I'm using this code: replacer = Regex.Replace(replacer, @"([\\n]+)[\s+]?\1", string.Empty); but it doesn't seem to work. It works in rubular though http://www.rubular.com/r/Ey6wrLYXNw – Immanu'el Smith Jun 10 '10 at 13:42
  • @Emmanuel Try `str = Regex.Replace(str, @"(\w+)\s+\1", "$1");` – Amarghosh Jun 10 '10 at 13:54
1

No doubt there is a smaller regex possible, but this one seems to do the trick:

string somestring = "cats cats cats and dogs dogs dogs and cats cats and dogs dogs";
Regex regex = new Regex(@"(\w+)\s(?:\1\s)*(?:\1(\s|$))");
string result = regex.Replace(somestring, "$1$2");

It also takes into account the last "dogs" not ending with a space.

C.Evenhuis
  • 25,996
  • 2
  • 58
  • 72
  • This will remove too many spaces: `cats cats cats and dogs dogs dogs and cats cats and dogs dogs` becomes `catsand dogsand catsand dogs`. It also matches too much: `Michael Bolton on CD` becomes `Michael BoltonCD`. Sorry about the Office Space reference. – Tim Pietzcker Jun 10 '10 at 14:10
  • Weird, I can't seem to reproduce those errors. Perhaps I should add some more pieces of flair :] – C.Evenhuis Jun 10 '10 at 14:23
  • 1
    Oops, I missed that you are replacing with `$1$2`, so the first problem I thought I saw is not there. But Michael Bolton still has a problem. Perhaps some hypnosis will help (or a word boundary `\b` before the `\w`). – Tim Pietzcker Jun 10 '10 at 14:28
0

Try the following code.



using System;
using System.Text.RegularExpressions;

namespace ConsoleApplication1 { /// <summary> ///
/// A description of the regular expression: ///
/// Match expression but don't capture it. [^|\s+] /// Select from 2 alternatives /// Beginning of line or string /// Whitespace, one or more repetitions /// [1]: A numbered capture group. [(\w+)(?:\s+|$)] /// (\w+)(?:\s+|$) /// [2]: A numbered capture group. [\w+] /// Alphanumeric, one or more repetitions /// Match expression but don't capture it. [\s+|$] /// Select from 2 alternatives /// Whitespace, one or more repetitions /// End of line or string /// [3]: A numbered capture group. [\1|\2], one or more repetitions /// Select from 2 alternatives /// Backreference to capture number: 1 /// Backreference to capture number: 2 ///
/// /// </summary> class Class1 { /// /// Point d'entrée principal de l'application. /// static void Main(string[] args) { Regex regex = new Regex( "(?:^|\s+)((\w+)(?:\s+|$))(\1|\2)+", RegexOptions.IgnoreCase | RegexOptions.Compiled ); string str = "cats cats cats and dogs dogs dogs and cats cats and dogs dogs"; string regexReplace = " $1";

Console.WriteLine("Before :" + str); str = regex.Replace(str,regexReplace); Console.WriteLine("After :" + str); } }

}

Stephan
  • 41,764
  • 65
  • 238
  • 329