I've been racking my brain for a couple of days to work out a series or closed-form equation to the following problem:
Specifically: given all strings of length N that draws from an alphabet of L letters (starting with 'A', for example {A, B}, {A, B, C}, ...), how many of those strings contain a substring that matches the pattern: 'A', more than 1 not-'A', 'A'. The standard regular expression for that pattern would be A[^A][^A]+A
.
The number of possible strings is simple enough: L^N . For small values of N and L, it's also very practical to simply create all possible combinations and use a regular expression to find the substrings that match the pattern; in R:
all.combinations <- function(N, L) {
apply(
expand.grid(rep(list(LETTERS[1:L]), N)),
1,
paste,
collapse = ''
)
}
matching.pattern <- function(N, L, pattern = 'A[^A][^A]+A') {
sum(grepl(pattern, all.combinations(N, L)))
}
all.combinations(4, 2)
matching.pattern(4, 2)
I had come up with the following, which works for N < 7:
M <- function(N, L) {
sum(
sapply(
2:(N-2),
function(g) {
(N - g - 1) * (L - 1) ** g * L ** (N - g - 2)
}
)
)
}
Unfortunately, that only works while N < 7 because it's simply adding the combinations that have substrings A..A, A...A, A....A, etc. and some combinations obviously have multiple matching substrings (e.g., A..A..A, A..A...A), which are counted twice.
Any suggestions? I am open to procedural solutions too, so long as they don't blow up with the number of combinations (like my code above would). I'd like to be able to compute for values of N from 15 to 25 and L from 2 to 10.
For what it is worth, here's the number of combinations, and matching combinations for some values of N and L that are tractable to determine by generating all combinations and doing a regular expression match:
N L combinations matching
-- - ------------ --------
4 2 16 1
5 2 32 5
6 2 64 17
7 2 128 48
8 2 256 122
9 2 512 290
10 2 1024 659
4 3 81 4
5 3 243 32
6 3 729 172
7 3 2187 760
8 3 6561 2996
9 3 19683 10960
10 3 59049 38076
4 4 256 9
5 4 1024 99
6 4 4096 729
7 4 16384 4410
8 4 65536 23778
9 4 262144 118854
10 4 1048576 563499