2

I have a string:

aaabbashasccddee

And I want to get matches of even number of consecutive same characters. For example, from the above string, I want these matches:

[bb],[cc],[dd],[ee]

I have tried this solution but it's not even close:

"^(..)*$
halfer
  • 19,824
  • 17
  • 99
  • 186
user786
  • 3,902
  • 4
  • 40
  • 72

2 Answers2

5

Fortunately .NET regular expressions are capable of handling infinite lookbehinds. What you need could be achieved using the following regex:

((?>(?(2)(?=\2))(.)\2)+)(?<!\2\1)(?!\2)

See live demo here

Regex breakdown:

  • ( Start of capturing group #1
    • (?> Start of non-capturing group (atomic)
      • (?(2) If capturing group #2 is set
        • (?=\2) Next character should be it
      • ) End f conditional
      • (.)\2 Match and capture a character and match it again (even number)
    • )+ Repeat as much as possible, at least once
  • ) End of capturing group #1
  • (?<!\2\1) Here is the trick. The lookbehind tells engine that the immediate preceding character that comes earlier than what we matched so far shouldn't be the same character stored in capturing group #2
  • (?!\2) Next character shouldn't be the same as the character stored in capturing group #2

UPDATE:

So you can do following code in C# to get all even sequences chars in string by Regex with no any other operators (pure Regex).

var allEvenSequences = Regex.Matches("aaabbashasccddee", @"((?>(?(2)(?=\2))(.)\2)+)(?<!\2\1)(?!\2)").Cast<Match>().ToList();

Also if you want to make [bb],[cc],[dd],[ee] then you can join that sequence array:

string strEvenSequences = string.Join(",", allEvenSequence.Select(x => $"[{x}]").ToArray());
//strEvenSequences will be [bb],[cc],[dd],[ee]
Aria
  • 3,724
  • 1
  • 20
  • 51
revo
  • 47,783
  • 14
  • 74
  • 117
2

Another possible regex-only solution that doesn't involve conditionals:

(.)(?<!\1\1)\1(?:\1\1)*(?!\1)

Breakdown:

(.)         # First capturing group - matches any character.
(?<!\1\1)   # Negative lookbehind - ensures the matched char isn't preceded by the same char.
\1          # Match another one of the character in the 1st group (at least two in total).
(?:\1\1)    # A non-capturing group that matches two occurrences of the same char.
*           # Matches between zero and unlimited times of the previous group.
(?!\1)      # Negative lookahead to make sure no extra occurrence of the char follows.

Demo:

string input = "aaabbashasccddee";
string pattern = @"(.)(?<!\1\1)\1(?:\1\1)*(?!\1)";
var matches = Regex.Matches(input, pattern);
foreach (Match m in matches)
    Console.WriteLine(m.Value);

Output:

bb
cc
dd
ee

Try it online.

  • This is gold. I liked the idea. +1 – revo Mar 16 '19 at 11:37
  • 1
    @revo Thanks a lot :-) I was actually working on it a couple minutes after the question was posted but I discarded it once the question got an answer with a working solution (now deleted). Then when the OP asked for a regex-only solution and I saw your answer (which is great, btw), I said "well, let me post my version too" :-D – 41686d6564 stands w. Palestine Mar 16 '19 at 11:42
  • @AhmedAbdelhameed what's the difference between `(?>..)` and `(?<..)` please let me know – user786 Mar 17 '19 at 05:32
  • @Alex `(?>..)` is an [atomic group](https://www.regular-expressions.info/atomic.html) (read more about it [here](https://stackoverflow.com/a/14412277/4934172)) while `(?<..)` is a [Lookbehind](https://www.regular-expressions.info/lookaround.html) which can either be positive (i.e., `(?<=..)`) or negative (i.e., `(?<!..)`). Both atomic group and Lookbehind are used (and named) in revo's answer while only a Lookbehind is used in mine. Tip: check all the links in this comment. They're very useful and you'll learn a lot. Good luck! :-) – 41686d6564 stands w. Palestine Mar 17 '19 at 05:40
  • @AhmedAbdelhameed is negative lookbehind means last character not equals to current charater? – user786 Mar 17 '19 at 05:52
  • 1
    @Alex A negative Lookbehind means that the _next_ character must _not_ be preceded by what's inside the Lookbehind. Please read the articles I provided to understand more. [Here's another great answer](https://stackoverflow.com/a/2973495/4934172) that explains Lookaheads, Lookbehinds, and atomic groups. – 41686d6564 stands w. Palestine Mar 17 '19 at 05:57
  • @AhmedAbdelhameed how to make it for odd numbered length substrings? – user786 Mar 17 '19 at 07:58
  • @Alex I believe that should be a new question :) – 41686d6564 stands w. Palestine Mar 17 '19 at 08:05
  • I need to know can the above solution be tweaked to make it work for odd numbered length? – user786 Mar 17 '19 at 08:06
  • Never mind, it's easier than I thought. To match odd number of chars _(excluding one-char matches)_, use [`(.)(?<!\1\1)\1(?:\1\1)*\1(?!\1)`](http://regexstorm.net/tester?p=%28.%29%28%3f%3c!%5c1%5c1%29%5c1%28%3f%3a%5c1%5c1%29*%5c1%28%3f!%5c1%29&i=aaabbashasccddee%0d%0aaabbbashascccdddeee). To match odd number of chars _(including one-char matches)_, use [`(.)(?<!\1\1)(?:\1\1)*(?!\1)`](http://regexstorm.net/tester?p=%28.%29%28%3f%3c!%5c1%5c1%29%28%3f%3a%5c1%5c1%29*%28%3f!%5c1%29&i=aaabbashasccddee%0d%0aaabbbashascccdddeee). – 41686d6564 stands w. Palestine Mar 17 '19 at 08:20
  • @AhmedAbdelhameed please check the quetion here https://stackoverflow.com/questions/55205729/regex-for-odd-length-substrings-in-string-regex-c-sharp – user786 Mar 17 '19 at 09:42