Final analysis
Adressing your latest comments.
As suspected, this can't be done with a regular expression unless
it can do counting. Specifically, countin with the ability to reset counters
when backtracking.
There is only one engine that will do that, and it is Perl, and unfortunately,
using Python for this task is out of the question.
I'm adding Perl regex below to do this. Only adding it to visualize the
methodology should you want to accomplish the same task without using regex.
Which certainly can be done.
Sorry, couldn't be more of a help to you. - sln
# (?{ $vb=0; $vc=0; $vd=0; })(?=(?![BCD]{2})(?![I])((?:(?:[B][I]*?)(?{ local $vb = $vb+1 })|(?:[C][I]*?)(?{ local $vc = $vc+1 })|(?:[D][I]*?)(?{ local $vd = $vd+1 }))+?)(?(?{$vb >= 2 && $vc >= 5 && $vd >= 2})(?{ $VB=$vb; $VC=$vc; $VD=$vd; })|(?!))(?<![I])(?<![BCD]{2}))
#
(?{ $vb=0; $vc=0; $vd=0; }) # Initialize local counters to zero
(?=
(?! [BCD]{2} ) # App Condition 5a, not start with 2 occurances of BCD
(?! [I] ) # App Condition 1a, not start with I
( # (1 start)
(?: # Cluster group start (App Conditions 2-4)
(?: [B] [I]*? ) # 'B'
(?{ local $vb = $vb+1 }) # Increment local 'B' counter
|
(?: [C] [I]*? ) # 'C'
(?{ local $vc = $vc+1 }) # Increment local 'C' counter
|
(?: [D] [I]*? ) # 'D'
(?{ local $vd = $vd+1 }) # Increment local 'D' counter
)+? # Cluster group end, do the minimum
# to satisfy conditions
) # (1 end)
(?(?{
# Code conditional - the local counters
# must be greater than or equal to these values
$vb >= 2 && $vc >= 5 && $vd >= 2
})
# Yes condition, copy local counters to global vars.
(?{ $VB=$vb; $VC=$vc; $VD=$vd; })
|
# No condition, fail the expression here
# force engine to backtrack (and reset local counters)
(?!)
)
(?<! [I] ) # App Condition 1b, not end with I
(?<! [BCD]{2} ) # App Condition 5b, not end with 2 occurances of BCD
)
Perl test case
$str = "IICCIICBIICCIIDCIICCIICDIICCIIBCIICCIICBIICCIIDCIICCIICCIICCII";
print "\n";
print "01234567890123456789012345678901234567890123456789012345678901\n";
print " 1 2 3 4 5 6\n";
print $str,"\n-------------------------------------------------------\n";
FindOverlaps(2,5,2);
FindOverlaps(1,2,0);
FindOverlaps(1,1,0);
FindOverlaps(1,1,1);
FindOverlaps(0,1,1);
FindOverlaps(1,0,1);
sub FindOverlaps
{
($MinB, $MinC, $MinD) = @_;
print "\nB=$MinB, C=$MinC, D=$MinD\n";
while ( $str =~ /
(?{ $vb=0; $vc=0; $vd=0; }) # Initialize local counters to zero
(?=
(?! [BCD]{2} ) # App Condition 5a, not start with 2 occurances of BCD
(?! [I] ) # App Condition 1a, not start with I
( # (1 start)
(?: # Cluster group start (App Conditions 2-4)
(?: [B] [I]*? ) # 'B'
(?{ local $vb = $vb+1 }) # Increment local 'B' counter
|
(?: [C] [I]*? ) # 'C'
(?{ local $vc = $vc+1 }) # Increment local 'C' counter
|
(?: [D] [I]*? ) # 'D'
(?{ local $vd = $vd+1 }) # Increment local 'D' counter
)+? # Cluster group end, do the minimum
# to satisfy conditions
) # (1 end)
(?(?{
# Code conditional - the local counters
# must be greater than or equal to these values
$vb >= $MinB && $vc >= $MinC && $vd >= $MinD
})
# Yes condition, copy local counters to global vars.
(?{ $VB=$vb; $VC=$vc; $VD=$vd; })
|
# No condition, fail the expression here
# force engine to backtrack (and reset local counters)
(?!)
)
(?<! [I] ) # App Condition 1b, not end with I
(?<! [BCD]{2} ) # App Condition 5b, not end with 2 occurances of BCD
)
/xg )
{
print sprintf("found: %-10s %-30s offset = %s\n", "\($VB,$VC,$VD\)", $1, @-[0]);
}
}
Output >>
01234567890123456789012345678901234567890123456789012345678901
1 2 3 4 5 6
IICCIICBIICCIIDCIICCIICDIICCIIBCIICCIICBIICCIIDCIICCIICCIICCII
-------------------------------------------------------
B=2, C=5, D=2
found: (2,10,2) CIICBIICCIIDCIICCIICDIICCIIB offset = 3
found: (2,8,2) BIICCIIDCIICCIICDIICCIIB offset = 7
found: (2,12,2) CIIDCIICCIICDIICCIIBCIICCIICBIIC offset = 11
found: (2,12,2) CIICCIICDIICCIIBCIICCIICBIICCIID offset = 15
found: (2,10,2) CIICDIICCIIBCIICCIICBIICCIID offset = 19
found: (2,8,2) DIICCIIBCIICCIICBIICCIID offset = 23
B=1, C=2, D=0
found: (1,3,0) CIICBIIC offset = 3
found: (1,2,1) BIICCIID offset = 7
found: (1,7,2) CIIDCIICCIICDIICCIIB offset = 11
found: (1,6,1) CIICCIICDIICCIIB offset = 15
found: (1,4,1) CIICDIICCIIB offset = 19
found: (1,2,1) DIICCIIB offset = 23
found: (1,3,0) CIIBCIIC offset = 27
found: (1,5,0) CIICCIICBIIC offset = 31
found: (1,3,0) CIICBIIC offset = 35
found: (1,2,1) BIICCIID offset = 39
B=1, C=1, D=0
found: (1,3,0) CIICBIIC offset = 3
found: (1,1,0) BIIC offset = 7
found: (1,7,2) CIIDCIICCIICDIICCIIB offset = 11
found: (1,6,1) CIICCIICDIICCIIB offset = 15
found: (1,4,1) CIICDIICCIIB offset = 19
found: (1,2,1) DIICCIIB offset = 23
found: (1,1,0) CIIB offset = 27
found: (1,5,0) CIICCIICBIIC offset = 31
found: (1,3,0) CIICBIIC offset = 35
found: (1,1,0) BIIC offset = 39
B=1, C=1, D=1
found: (1,4,1) CIICBIICCIID offset = 3
found: (1,2,1) BIICCIID offset = 7
found: (1,7,2) CIIDCIICCIICDIICCIIB offset = 11
found: (1,6,1) CIICCIICDIICCIIB offset = 15
found: (1,4,1) CIICDIICCIIB offset = 19
found: (1,2,1) DIICCIIB offset = 23
found: (2,7,1) CIIBCIICCIICBIICCIID offset = 27
found: (1,6,1) CIICCIICBIICCIID offset = 31
found: (1,4,1) CIICBIICCIID offset = 35
found: (1,2,1) BIICCIID offset = 39
B=0, C=1, D=1
found: (1,4,1) CIICBIICCIID offset = 3
found: (1,2,1) BIICCIID offset = 7
found: (0,1,1) CIID offset = 11
found: (0,5,1) CIICCIICDIIC offset = 15
found: (0,3,1) CIICDIIC offset = 19
found: (0,1,1) DIIC offset = 23
found: (2,7,1) CIIBCIICCIICBIICCIID offset = 27
found: (1,6,1) CIICCIICBIICCIID offset = 31
found: (1,4,1) CIICBIICCIID offset = 35
found: (1,2,1) BIICCIID offset = 39
found: (0,1,1) CIID offset = 43
B=1, C=0, D=1
found: (1,4,1) CIICBIICCIID offset = 3
found: (1,2,1) BIICCIID offset = 7
found: (1,7,2) CIIDCIICCIICDIICCIIB offset = 11
found: (1,6,1) CIICCIICDIICCIIB offset = 15
found: (1,4,1) CIICDIICCIIB offset = 19
found: (1,2,1) DIICCIIB offset = 23
found: (2,7,1) CIIBCIICCIICBIICCIID offset = 27
found: (1,6,1) CIICCIICBIICCIID offset = 31
found: (1,4,1) CIICBIICCIID offset = 35
found: (1,2,1) BIICCIID offset = 39
(old)
I think this is the best you could do with a regular expression
Edit - Modified for new condition 5.
# String:
# (?=(?![BCD]{2})(?![I])((?:[B][IDC]*?){1}(?:[C][IDB]*?){2}(?:[D][IBC]*?){0}|(?:[C][IDB]*?){2}(?:[D][IBC]*?){0}(?:[B][IDC]*?){1}|(?:[D][IBC]*?){0}(?:[B][IDC]*?){1}(?:[C][IDB]*?){2}|(?:[C][IDB]*?){2}(?:[B][IDC]*?){1}(?:[D][IBC]*?){0})(?<![I])(?<![BCD]{2}))
# Example: Finds 1-B, 2-C's
(?=
(?! [BCD]{2} ) # Condition 5a, not start with 2 occurances of BCD
(?! [I] ) # Condition 1a, not start with I (not really necessary here)
( # (1 start), Conditions 2-4
(?: [B] [IDC]*? ){1}
(?: [C] [IDB]*? ){2}
(?: [D] [IBC]*? ){0}
|
(?: [C] [IDB]*? ){2}
(?: [D] [IBC]*? ){0}
(?: [B] [IDC]*? ){1}
|
(?: [D] [IBC]*? ){0}
(?: [B] [IDC]*? ){1}
(?: [C] [IDB]*? ){2}
|
(?: [C] [IDB]*? ){2}
(?: [B] [IDC]*? ){1}
(?: [D] [IBC]*? ){0}
) # (1 end)
(?<! [I] ) # Condition 1b, not end with I
(?<! [BCD]{2} ) # Condition 5b, not end with 2 occurances of BCD
)
Perl test case
$str = "IICCIICCIICBIICCIICDIIDIICCIIB";
print "\n";
print "012345678911234567892123456789\n";
print " + + \n";
print $str,"\n------------------------------\n";
($B,$C,$D) = (1,2,0);
FindOverlaps();
($B,$C,$D) = (1,1,0);
FindOverlaps();
($B,$C,$D) = (1,1,1);
FindOverlaps();
($B,$C,$D) = (0,1,1);
FindOverlaps();
($B,$C,$D) = (1,0,1);
FindOverlaps();
sub FindOverlaps
{
print "\nB=$B, C=$C, D=$D\n";
while ( $str =~ /(?=(?![BCD]{2})(?![I])((?:[B][IDC]*?){$B}(?:[C][IDB]*?){$C}(?:[D][IBC]*?){$D}|(?:[C][IDB]*?){$C}(?:[D][IBC]*?){$D}(?:[B][IDC]*?){$B}|(?:[D][IBC]*?){$D}(?:[B][IDC]*?){$B}(?:[C][IDB]*?){$C}|(?:[C][IDB]*?){$C}(?:[B][IDC]*?){$B}(?:[D][IBC]*?){$D})(?<![I])(?<![BCD]{2}))/g )
{
print "found: '$1' \t offset = @-[0]\n";
}
}
Output >>
012345678911234567892123456789
+ +
IICCIICCIICBIICCIICDIIDIICCIIB
------------------------------
B=1, C=2, D=0
found: 'CIICBIIC' offset = 7
found: 'BIICCIIC' offset = 11
B=1, C=1, D=0
found: 'BIIC' offset = 11
found: 'CIIB' offset = 26
B=1, C=1, D=1
found: 'BIICCIICDIID' offset = 11
B=0, C=1, D=1
found: 'DIIC' offset = 22
B=1, C=0, D=1
found: 'BIICCIICDIID' offset = 11
found: 'DIICCIIB' offset = 22