Generate a regex to find sub strings with certain number of occurrences of certain characters

Question

Suppose I have a string "IICCIICCIICBIICCIICDII". The string is of the pattern II[CBD][CBD]II[CBD][CBD]II... It is a repeated pattern. Now I am trying to find all overlapping substrings which satisfies following condition:

Substring does not start or end with letter I. Suggested solution: (?<=[CBD]), (?=[CBD])
Substring contains at least(but as less as possible) certain number of occurrences of letter C, B, and D. These letters can exist in any permutation and can have different number of occurrences. Suggested solution: something along the lines of [C]{m, n}?
These pre-defined numbers are variable so that I can dynamically generate regex with change in requirement for these variables.
Order and number of occurrences of letter I doesn't matter
Adding to condition 1: Start/end of a substring can't have two occurrences of [CBD] (e.g., CCIIB or BIICC is an invalid match) (sorry about that)

For example, for a pattern with at least 2 Cs: CIIC (three of them), 2 Cs and 1 B: CIICBIIC, BIICCIIC

I do not think my question is similar to the question quoted in one of the answers. I have looked at that question (titled "shortest repeated substring"). My question differs in the sense that repeated pattern needs to have a certain number of occurrences of certain characters. The question quoted is just finding shortest repeating pattern. That question IS helpful though.

Let me know if the question is clear and not a duplicate. Thanks.

@aliteralmind: While the linked questions are similar, this Q is for Python, not for C# or Java. For Python, [this Q](http://stackoverflow.com/questions/8633996/shortest-repeating-sub-string) might help. It seems the user is familiar with regexes, but not with Python. Is this correct? — cfi, Apr 02 '14 at 21:23
This `at least 2 Cs: CIIC (three of them), 2 Cs and 1 B: CIICBIIC, BIICCIIC` is not too clear. — , Apr 02 '14 at 21:45
Ok, so I added an answer to this very dificult question. But, **Duplicate**, really? The duplicate reference link [here](http://stackoverflow.com/questions/320448/overlapping-matches-in-regex) (from above) accepts an answer which is this `(?<=n)n`. Does that really answer this question? Any standalone lookahead is overlapping. Might as well mark all regex questions as duplicate because thats the level this question is to that one. Too bad, because this question has a lot of territory for thought that will never be. — , Apr 03 '14 at 14:56
Thanks @sln for recognizing that this is not a duplicate question. — user3491040, Apr 04 '14 at 11:44

score 1 · Accepted Answer · 2014-05-02T17:15:42.330

Final analysis

Adressing your latest comments.
As suspected, this can't be done with a regular expression unless
it can do counting. Specifically, countin with the ability to reset counters
when backtracking.

There is only one engine that will do that, and it is Perl, and unfortunately,
using Python for this task is out of the question.

I'm adding Perl regex below to do this. Only adding it to visualize the
methodology should you want to accomplish the same task without using regex.
Which certainly can be done.

Sorry, couldn't be more of a help to you. - sln

 # (?{ $vb=0; $vc=0; $vd=0; })(?=(?![BCD]{2})(?![I])((?:(?:[B][I]*?)(?{ local $vb = $vb+1 })|(?:[C][I]*?)(?{ local $vc = $vc+1 })|(?:[D][I]*?)(?{ local $vd = $vd+1 }))+?)(?(?{$vb >= 2 && $vc >= 5 && $vd >= 2})(?{ $VB=$vb; $VC=$vc; $VD=$vd; })|(?!))(?<![I])(?<![BCD]{2}))
 #

 (?{ $vb=0; $vc=0; $vd=0; })         # Initialize local counters to zero
 (?=
      (?! [BCD]{2} )                      # App Condition 5a, not start with 2 occurances of BCD
      (?! [I] )                           # App Condition 1a, not start with I
      (                                   # (1 start)
           (?:                                 # Cluster group start (App Conditions 2-4)
                (?: [B] [I]*? )                     # 'B'
                (?{ local $vb = $vb+1 })            # Increment local 'B' counter
             |  
                (?: [C] [I]*? )                     # 'C'
                (?{ local $vc = $vc+1 })            # Increment local 'C' counter
             |  
                (?: [D] [I]*? )                     # 'D'
                (?{ local $vd = $vd+1 })            # Increment local 'D' counter
           )+?                                 # Cluster group end, do the minimum
                                               # to satisfy conditions
      )                                   # (1 end)

      (?(?{
           # Code conditional - the local counters
           # must be greater than or equal to these values
           $vb >= 2 && $vc >= 5 && $vd >= 2
        })
           # Yes condition, copy local counters to global vars.
           (?{ $VB=$vb; $VC=$vc; $VD=$vd; })
        |  
           # No condition, fail the expression here
           # force engine to backtrack (and reset local counters) 
           (?!)
      )
      (?<! [I] )                          # App Condition 1b, not end with I
      (?<! [BCD]{2} )                     # App Condition 5b, not end with 2 occurances of BCD
 )

Perl test case

 $str = "IICCIICBIICCIIDCIICCIICDIICCIIBCIICCIICBIICCIIDCIICCIICCIICCII";
 print  "\n";
 print  "01234567890123456789012345678901234567890123456789012345678901\n";
 print  "          1         2         3         4         5         6\n";
 print  $str,"\n-------------------------------------------------------\n";

 FindOverlaps(2,5,2);
 FindOverlaps(1,2,0);
 FindOverlaps(1,1,0);
 FindOverlaps(1,1,1);
 FindOverlaps(0,1,1);
 FindOverlaps(1,0,1);

 sub FindOverlaps
 {
     ($MinB, $MinC, $MinD) = @_;

     print "\nB=$MinB, C=$MinC, D=$MinD\n";

     while ( $str =~ /

          (?{ $vb=0; $vc=0; $vd=0; })         # Initialize local counters to zero
          (?=
               (?! [BCD]{2} )                      # App Condition 5a, not start with 2 occurances of BCD
               (?! [I] )                           # App Condition 1a, not start with I
               (                                   # (1 start)
                    (?:                                 # Cluster group start (App Conditions 2-4)
                         (?: [B] [I]*? )                     # 'B'
                         (?{ local $vb = $vb+1 })            # Increment local 'B' counter
                      |  
                         (?: [C] [I]*? )                     # 'C'
                         (?{ local $vc = $vc+1 })            # Increment local 'C' counter
                      |  
                         (?: [D] [I]*? )                     # 'D'
                         (?{ local $vd = $vd+1 })            # Increment local 'D' counter
                    )+?                                 # Cluster group end, do the minimum
                                                        # to satisfy conditions
               )                                   # (1 end)

               (?(?{
                    # Code conditional - the local counters
                    # must be greater than or equal to these values
                    $vb >= $MinB && $vc >= $MinC && $vd >= $MinD
                 })
                    # Yes condition, copy local counters to global vars.
                    (?{ $VB=$vb; $VC=$vc; $VD=$vd; })
                 |  
                    # No condition, fail the expression here
                    # force engine to backtrack (and reset local counters) 
                    (?!)
               )
               (?<! [I] )                          # App Condition 1b, not end with I
               (?<! [BCD]{2} )                     # App Condition 5b, not end with 2 occurances of BCD
          )
     /xg )
     {
        print sprintf("found:   %-10s %-30s  offset = %s\n", "\($VB,$VC,$VD\)", $1, @-[0]);
     }
 }

Output >>

 01234567890123456789012345678901234567890123456789012345678901
           1         2         3         4         5         6
 IICCIICBIICCIIDCIICCIICDIICCIIBCIICCIICBIICCIIDCIICCIICCIICCII
 -------------------------------------------------------

 B=2, C=5, D=2
 found:   (2,10,2)   CIICBIICCIIDCIICCIICDIICCIIB    offset = 3
 found:   (2,8,2)    BIICCIIDCIICCIICDIICCIIB        offset = 7
 found:   (2,12,2)   CIIDCIICCIICDIICCIIBCIICCIICBIIC  offset = 11
 found:   (2,12,2)   CIICCIICDIICCIIBCIICCIICBIICCIID  offset = 15
 found:   (2,10,2)   CIICDIICCIIBCIICCIICBIICCIID    offset = 19
 found:   (2,8,2)    DIICCIIBCIICCIICBIICCIID        offset = 23

 B=1, C=2, D=0
 found:   (1,3,0)    CIICBIIC                        offset = 3
 found:   (1,2,1)    BIICCIID                        offset = 7
 found:   (1,7,2)    CIIDCIICCIICDIICCIIB            offset = 11
 found:   (1,6,1)    CIICCIICDIICCIIB                offset = 15
 found:   (1,4,1)    CIICDIICCIIB                    offset = 19
 found:   (1,2,1)    DIICCIIB                        offset = 23
 found:   (1,3,0)    CIIBCIIC                        offset = 27
 found:   (1,5,0)    CIICCIICBIIC                    offset = 31
 found:   (1,3,0)    CIICBIIC                        offset = 35
 found:   (1,2,1)    BIICCIID                        offset = 39

 B=1, C=1, D=0
 found:   (1,3,0)    CIICBIIC                        offset = 3
 found:   (1,1,0)    BIIC                            offset = 7
 found:   (1,7,2)    CIIDCIICCIICDIICCIIB            offset = 11
 found:   (1,6,1)    CIICCIICDIICCIIB                offset = 15
 found:   (1,4,1)    CIICDIICCIIB                    offset = 19
 found:   (1,2,1)    DIICCIIB                        offset = 23
 found:   (1,1,0)    CIIB                            offset = 27
 found:   (1,5,0)    CIICCIICBIIC                    offset = 31
 found:   (1,3,0)    CIICBIIC                        offset = 35
 found:   (1,1,0)    BIIC                            offset = 39

 B=1, C=1, D=1
 found:   (1,4,1)    CIICBIICCIID                    offset = 3
 found:   (1,2,1)    BIICCIID                        offset = 7
 found:   (1,7,2)    CIIDCIICCIICDIICCIIB            offset = 11
 found:   (1,6,1)    CIICCIICDIICCIIB                offset = 15
 found:   (1,4,1)    CIICDIICCIIB                    offset = 19
 found:   (1,2,1)    DIICCIIB                        offset = 23
 found:   (2,7,1)    CIIBCIICCIICBIICCIID            offset = 27
 found:   (1,6,1)    CIICCIICBIICCIID                offset = 31
 found:   (1,4,1)    CIICBIICCIID                    offset = 35
 found:   (1,2,1)    BIICCIID                        offset = 39

 B=0, C=1, D=1
 found:   (1,4,1)    CIICBIICCIID                    offset = 3
 found:   (1,2,1)    BIICCIID                        offset = 7
 found:   (0,1,1)    CIID                            offset = 11
 found:   (0,5,1)    CIICCIICDIIC                    offset = 15
 found:   (0,3,1)    CIICDIIC                        offset = 19
 found:   (0,1,1)    DIIC                            offset = 23
 found:   (2,7,1)    CIIBCIICCIICBIICCIID            offset = 27
 found:   (1,6,1)    CIICCIICBIICCIID                offset = 31
 found:   (1,4,1)    CIICBIICCIID                    offset = 35
 found:   (1,2,1)    BIICCIID                        offset = 39
 found:   (0,1,1)    CIID                            offset = 43

 B=1, C=0, D=1
 found:   (1,4,1)    CIICBIICCIID                    offset = 3
 found:   (1,2,1)    BIICCIID                        offset = 7
 found:   (1,7,2)    CIIDCIICCIICDIICCIIB            offset = 11
 found:   (1,6,1)    CIICCIICDIICCIIB                offset = 15
 found:   (1,4,1)    CIICDIICCIIB                    offset = 19
 found:   (1,2,1)    DIICCIIB                        offset = 23
 found:   (2,7,1)    CIIBCIICCIICBIICCIID            offset = 27
 found:   (1,6,1)    CIICCIICBIICCIID                offset = 31
 found:   (1,4,1)    CIICBIICCIID                    offset = 35
 found:   (1,2,1)    BIICCIID                        offset = 39

(old)

I think this is the best you could do with a regular expression

Edit - Modified for new condition 5.

 #  String:
 #  (?=(?![BCD]{2})(?![I])((?:[B][IDC]*?){1}(?:[C][IDB]*?){2}(?:[D][IBC]*?){0}|(?:[C][IDB]*?){2}(?:[D][IBC]*?){0}(?:[B][IDC]*?){1}|(?:[D][IBC]*?){0}(?:[B][IDC]*?){1}(?:[C][IDB]*?){2}|(?:[C][IDB]*?){2}(?:[B][IDC]*?){1}(?:[D][IBC]*?){0})(?<![I])(?<![BCD]{2}))

 # Example: Finds 1-B, 2-C's     
 (?=
      (?! [BCD]{2} )              # Condition 5a, not start with 2 occurances of BCD
      (?! [I] )                   # Condition 1a, not start with I (not really necessary here)

      (                           # (1 start), Conditions 2-4
           (?: [B] [IDC]*? ){1}
           (?: [C] [IDB]*? ){2}
           (?: [D] [IBC]*? ){0}
        |  
           (?: [C] [IDB]*? ){2}
           (?: [D] [IBC]*? ){0}
           (?: [B] [IDC]*? ){1}
        |  
           (?: [D] [IBC]*? ){0}
           (?: [B] [IDC]*? ){1}
           (?: [C] [IDB]*? ){2}
        |  
           (?: [C] [IDB]*? ){2}
           (?: [B] [IDC]*? ){1}
           (?: [D] [IBC]*? ){0}
      )                           # (1 end)

      (?<! [I] )                  # Condition 1b, not end with I
      (?<! [BCD]{2} )             # Condition 5b, not end with 2 occurances of BCD
 )

Perl test case

  $str = "IICCIICCIICBIICCIICDIIDIICCIIB";

  print  "\n";
  print  "012345678911234567892123456789\n";
  print  "          +         +         \n";
  print  $str,"\n------------------------------\n";

  ($B,$C,$D) = (1,2,0);
  FindOverlaps();

  ($B,$C,$D) = (1,1,0);
  FindOverlaps();

  ($B,$C,$D) = (1,1,1);
  FindOverlaps();

  ($B,$C,$D) = (0,1,1);
  FindOverlaps();

  ($B,$C,$D) = (1,0,1);
  FindOverlaps();

  sub FindOverlaps
  {
      print "\nB=$B, C=$C, D=$D\n";

      while ( $str =~ /(?=(?![BCD]{2})(?![I])((?:[B][IDC]*?){$B}(?:[C][IDB]*?){$C}(?:[D][IBC]*?){$D}|(?:[C][IDB]*?){$C}(?:[D][IBC]*?){$D}(?:[B][IDC]*?){$B}|(?:[D][IBC]*?){$D}(?:[B][IDC]*?){$B}(?:[C][IDB]*?){$C}|(?:[C][IDB]*?){$C}(?:[B][IDC]*?){$B}(?:[D][IBC]*?){$D})(?<![I])(?<![BCD]{2}))/g )
      {
          print "found:  '$1' \t offset = @-[0]\n";
      }
  }

Output >>

 012345678911234567892123456789
           +         +
 IICCIICCIICBIICCIICDIIDIICCIIB
 ------------------------------

 B=1, C=2, D=0
 found:  'CIICBIIC'       offset = 7
 found:  'BIICCIIC'       offset = 11

 B=1, C=1, D=0
 found:  'BIIC'   offset = 11
 found:  'CIIB'   offset = 26

 B=1, C=1, D=1
 found:  'BIICCIICDIID'   offset = 11

 B=0, C=1, D=1
 found:  'DIIC'   offset = 22

 B=1, C=0, D=1
 found:  'BIICCIICDIID'   offset = 11
 found:  'DIICCIIB'       offset = 22

Thanks for your reply. I am looking for something along the lines what you provided. Although, I should have been clearer. There is another condition (refer to condition 5 in the edited question, sorry about that). This condition will not allow matches such as BC, CB, BIICC (anything that has two consecutive [CBD] at the end or start of a match. — user3491040, Apr 04 '14 at 11:43
@user3491040 - Your welcome! Added condition 5, a simple fix. — , Apr 04 '14 at 21:38
@user3491040: consider up-voting and accepting this answer, by clicking on the big green checkmark. @ sln: I've voted to reopen. — aliteralmind, Apr 07 '14 at 04:08
I think I cant vote because I am new to stack overflow but I did accept your answer. — user3491040, Apr 07 '14 at 23:42
There might be something missing in this regex. When I tried to find a substring with at least 5 Cs, 2 Bs, and 2 Ds in IICCIICBIICCIIDCIICCIICDIICCIIBCIICCIICBIICCIIDCIICCIICCIICCII, it couldn't where you can easily see that there exist at least one non overlapping and more than one overlapping pattern: BIICCIIDCIICCIICDIICCIIB, CIIDCIICCIICDIICCIIBCIICCIICBIIC Thanks for your time and help. — user3491040, Apr 22 '14 at 00:30
Upvote/downvote like a yoyo isin't a way to get my attention. I'll think about your comment. — , May 01 '14 at 19:57
@user3491040 - I've added an update, and some code. A Perl regex solution is do'able, probably not in Python though. — , May 02 '14 at 17:19

Generate a regex to find sub strings with certain number of occurrences of certain characters

1 Answers1