7

I've been racking my brain for a couple of days to work out a series or closed-form equation to the following problem:

Specifically: given all strings of length N that draws from an alphabet of L letters (starting with 'A', for example {A, B}, {A, B, C}, ...), how many of those strings contain a substring that matches the pattern: 'A', more than 1 not-'A', 'A'. The standard regular expression for that pattern would be A[^A][^A]+A.

The number of possible strings is simple enough: L^N . For small values of N and L, it's also very practical to simply create all possible combinations and use a regular expression to find the substrings that match the pattern; in R:

all.combinations <- function(N, L) {
    apply(
        expand.grid(rep(list(LETTERS[1:L]), N)),
        1,
        paste,
        collapse = ''
    )
}

matching.pattern <- function(N, L, pattern = 'A[^A][^A]+A') {
    sum(grepl(pattern, all.combinations(N, L)))
}

all.combinations(4, 2)
matching.pattern(4, 2)

I had come up with the following, which works for N < 7:

M <- function(N, L) {
    sum(
        sapply(
            2:(N-2),
            function(g) {
                (N - g - 1) * (L - 1) ** g * L ** (N - g - 2)
            }
        )
    )
}

Unfortunately, that only works while N < 7 because it's simply adding the combinations that have substrings A..A, A...A, A....A, etc. and some combinations obviously have multiple matching substrings (e.g., A..A..A, A..A...A), which are counted twice.

Any suggestions? I am open to procedural solutions too, so long as they don't blow up with the number of combinations (like my code above would). I'd like to be able to compute for values of N from 15 to 25 and L from 2 to 10.

For what it is worth, here's the number of combinations, and matching combinations for some values of N and L that are tractable to determine by generating all combinations and doing a regular expression match:

 N  L  combinations  matching
--  -  ------------  --------
 4  2            16         1
 5  2            32         5
 6  2            64        17
 7  2           128        48
 8  2           256       122
 9  2           512       290
10  2          1024       659
 4  3            81         4
 5  3           243        32
 6  3           729       172
 7  3          2187       760
 8  3          6561      2996
 9  3         19683     10960
10  3         59049     38076
 4  4           256         9
 5  4          1024        99
 6  4          4096       729
 7  4         16384      4410
 8  4         65536     23778
 9  4        262144    118854
10  4       1048576    563499
  • 3
    I would guess you'd get more interest on math.stackexchange – IceCreamToucan Oct 15 '18 at 19:06
  • `AAAAA` shouldn't be counted, since it does not match the pattern (the letter A, followed by two or more letters that are not A, followed by another letter A). For N = 5 and L = 2, the matching patterns are `ABBAA, AABBA, BABBA, ABBBA, ABBAB`. If you are familiar with the regular expression syntax, the pattern would be expressed as `A[^A][^A]+A`. – James McIninch Oct 15 '18 at 19:18

2 Answers2

1

It is possible to use dynamic programming approach.

For fixed L, let X(n) be number of strings of length n that contain given pattern, and let A(n) be number of strings of length n that contain given pattern and starts with A.

First derive recursion formula for A(n). Lets count all strings in A(n) by grouping them by first 2-3 letters. Number of strings in A(n) with:

  • "second letter A" is A(n-1),
  • "second letter non-A and third letter is A" is A(n-2),
  • "second and third letter non-A" is (L^(n-3) - (L-1)^(n-3)). That is because string 'needs' at least one A in remaining letters to be counted.

With that:

A(n) = A(n-1) + (L-1) * (A(n-2) + (L-1) * (L^(n-3) - (L-1)^(n-3)))

String of length n+1 can start with A or non-A:

X(n+1) = A(n+1) + (L-1) * X(n)
X(i) = A(i) = 0, for i <= 3

Python implementation:

def combs(l, n):
    x = [0] * (n + 1)  # First element is not used, easier indexing
    a = [0] * (n + 1)
    for i in range(4, n+1):
        a[i] = a[i-1] + (l-1) * (a[i-2] + (l-1) * (l**(i-3) - (l-1)**(i-3)))
        x[i] = a[i] + (l-1) * x[i-1]
    return x[4:]

print(combs(2, 10))
print(combs(3, 10))
print(combs(4, 10))
Ante
  • 5,350
  • 6
  • 23
  • 46
0

This can be described as a state machine. (For simplicity, x is any letter other than A.)

S0 := 'A' S1 | 'x' S0     // ""
S1 := 'A' S1 | 'x' S2     // A
S2 := 'A' S1 | 'x' S3     // Ax
S3 := 'A' S4 | 'x' S3     // Axx+
S4 := 'A' S4 | 'x' S4 | $ // AxxA

Counting the number of matching strings of length n

S0(n) = S1(n-1) + (L-1)*S0(n-1); S0(0) = 0
S1(n) = S1(n-1) + (L-1)*S2(n-1); S1(0) = 0
S2(n) = S1(n-1) + (L-1)*S3(n-1); S2(0) = 0
S3(n) = S4(n-1) + (L-1)*S3(n-1); S3(0) = 0
S4(n) = S4(n-1) + (L-1)*S4(n-1); S4(0) = 1

Trying to reduce S0(n) to just n and L gives a really long expression, so it would be easiest to calculate the recurrence functions as-is.

For really large n, this could be expressed as a matrix expression, and be efficiently calculated.

                                   n
              [L-1  1   0   0   0 ]
              [ 0   1  L-1  0   0 ]              T
[0 0 0 0 1] × [ 0   1   0  L-1  0 ] × [1 0 0 0 0]
              [ 0   0   0  L-1  1 ]
              [ 0   0   0   0   L ]

In JavaScript:

function f(n, L) {
  var S0 = 0, S1 = 0, S2 = 0, S3 = 0, S4 = 1;
  var S1_tmp;
  while (n-- > 0) {
    S0 = S1 + (L - 1) * S0;
    S1_tmp = S1 + (L - 1) * S2;
    S2 = S1 + (L - 1) * S3;
    S3 = S4 + (L - 1) * S3;
    S4 = S4 + (L - 1) * S4;
    S1 = S1_tmp;
  }
  return S0;
}

var $tbody = $('#resulttable > tbody');
for (var L = 2; L <= 4; L++) {
  for (var n = 4; n <= 10; n++) {
    $('<tr>').append([
      $('<td>').text(n),
      $('<td>').text(L),
      $('<td>').text(f(n,L))
    ]).appendTo($tbody);
  }
}
#resulttable td {
  text-align: right;
}
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<table id="resulttable">
<thead>
<tr>
<th>N</th>
<th>L</th>
<th>matches</th>
</tr>
</thead>
<tbody>
</tbody>
</table>
Markus Jarderot
  • 86,735
  • 21
  • 136
  • 138