Edit: I've updated the solution to use cached short sequences for lengths greater than 4. This significantly speeds up the calculations. With the simple version, it'd take 18.5 hours to generate all sequences of length 7, but with the new method only 4.5 hours.
I'll let the docstring do all of the talking for describing the solution.
"""
Problem:
Generate a string of N characters that only contains alphanumerical
characters. The following restrictions apply:
* 0-9 must come before a-z, which must come before A-Z
* it's valid to not have any digits or letters in a sequence
* no neighbouring characters can be the same
* the sequences must be in an order as if the string is base62, e.g.,
01010...01019, 0101a...0101z, 0101A...0101Z, 01020...etc
Solution:
Implement a recursive approach which discards invalid trees. For example,
for "---" start with "0--" and recurse. Try "00-", but discard it for
"01-". The first and last sequences would then be "010" and "ZYZ".
If the previous character in the sequence is a lowercase letter, such as
in "02f-", shrink the pool of available characters to a-zA-Z. Similarly,
for "9gB-", we should only be working with A-Z.
The input also allows to define a specific sequence to start from. For
example, for "abGH", each character will have access to a limited set of
its pool. In this case, the last letter can iterate from H to Z, at which
point it'll be free to iterate its whole character pool next time around.
When specifying a starting sequence, if it doesn't have enough characters
compared to `length`, it will be padded to the right with characters free
to explore their character pool. For example, for length 4, the starting
sequence "29" will be transformed to "29 ", where we will deal with two
restricted characters temporarily.
For long lengths the function internally calls a routine which relies on
fewer recursions and cached results. Length 4 has been chosen as optimal
in terms of precomputing time and memory demands. Briefly, the sequence is
broken into a remainder and chunks of 4. For each preceeding valid
subsequence, all valid following subsequences are fetched. For example, a
sequence of six would be split into "--|----" and for "fB|----" all
subsequences of 4 starting A, C, D, etc would be produced.
Examples:
>>> for i, x in enumerate(generate_sequences(7)):
... print i, x
0, 0101010
1, 0101012
etc
>>> for i, x in enumerate(generate_sequences(7, '012abcAB')):
... print i, x
0, 012abcAB
1, 012abcAC
etc
>>> for i, x in enumerate(generate_sequences(7, 'aB')):
... print i, x
0, aBABABA
1, aBABABC
etc
"""
import string
ALLOWED_CHARS = (string.digits + string.ascii_letters,
string.ascii_letters,
string.ascii_uppercase,
)
CACHE_LEN = 4
def _generate_sequences(length, sequence, previous=''):
char_set = ALLOWED_CHARS[previous.isalpha() * (2 - previous.islower())]
if sequence[-length] != ' ':
char_set = char_set[char_set.find(sequence[-length]):]
sequence[-length] = ' '
char_set = char_set.replace(previous, '')
if length == 1:
for char in char_set:
yield char
else:
for char in char_set:
for seq in _generate_sequences(length-1, sequence, char):
yield char + seq
def _generate_sequences_cache(length, sequence, cache, previous=''):
sublength = length if length == CACHE_LEN else min(CACHE_LEN, length-CACHE_LEN)
subseq = cache[sublength != CACHE_LEN]
char_set = ALLOWED_CHARS[previous.isalpha() * (2 - previous.islower())]
if sequence[-length] != ' ':
char_set = char_set[char_set.find(sequence[-length]):]
index = len(sequence) - length
subseq0 = ''.join(sequence[index:index+sublength]).strip()
sequence[index:index+sublength] = [' '] * sublength
if len(subseq0) > 1:
subseq[char_set[0]] = tuple(
s for s in subseq[char_set[0]] if s.startswith(subseq0))
char_set = char_set.replace(previous, '')
if length == CACHE_LEN:
for char in char_set:
for seq in subseq[char]:
yield seq
else:
for char in char_set:
for seq1 in subseq[char]:
for seq2 in _generate_sequences_cache(
length-sublength, sequence, cache, seq1[-1]):
yield seq1 + seq2
def precompute(length):
char_set = ALLOWED_CHARS[0]
if length > 1:
sequence = [' '] * length
result = {}
for char in char_set:
result[char] = tuple(char + seq for seq in _generate_sequences(
length-1, sequence, char))
else:
result = {char: tuple(char) for char in ALLOWED_CHARS[0]}
return result
def generate_sequences(length, sequence=''):
# -------------------------------------------------------------------------
# Error checking: consistency of the value/type of the arguments
if not isinstance(length, int):
msg = 'The sequence length must be an integer: {}'
raise TypeError(msg.format(type(length)))
if length < 0:
msg = 'The sequence length must be greater or equal than 0: {}'
raise ValueError(msg.format(length))
if not isinstance(sequence, str):
msg = 'The sequence must be a string: {}'
raise TypeError(msg.format(type(sequence)))
if len(sequence) > length:
msg = 'The sequence has length greater than {}'
raise ValueError(msg.format(length))
# -------------------------------------------------------------------------
if not length:
yield ''
else:
# ---------------------------------------------------------------------
# Error checking: the starting sequence, if provided, must be valid
if any(s not in ALLOWED_CHARS[0]+' ' for s in sequence):
msg = 'The sequence contains invalid characters: {}'
raise ValueError(msg.format(sequence))
if sequence.strip() != sequence.replace(' ', ''):
msg = 'Uninitiated characters in the middle of the sequence: {}'
raise ValueError(msg.format(sequence.strip()))
sequence = sequence.strip()
if any(a == b for a, b in zip(sequence[:-1], sequence[1:])):
msg = 'No neighbours must be the same character: {}'
raise ValueError(msg.format(sequence))
char_type = [s.isalpha() * (2 - s.islower()) for s in sequence]
if char_type != sorted(char_type):
msg = '0-9 must come before a-z, which must come before A-Z: {}'
raise ValueError(msg.format(sequence))
# ---------------------------------------------------------------------
sequence = list(sequence.ljust(length))
if length <= CACHE_LEN:
for s in _generate_sequences(length, sequence):
yield s
else:
remainder = length % CACHE_LEN
if not remainder:
cache = tuple((precompute(CACHE_LEN),))
else:
cache = tuple((precompute(CACHE_LEN), precompute(remainder)))
for s in _generate_sequences_cache(length, sequence, cache):
yield s
I've included thorough error checks in the generate_sequences()
function. For the sake of brevity you can remove them if you can guarantee that whoever calls the function will never do so with invalid input. Specifically, invalid starting sequences.
Counting number of sequences of specific length
While the function will sequentially generate the sequences, there is a simple combinatorics calcuation we can perform to compute how many valid sequences exist in total.
The sequences can effectively be broken down to 3 separate subsequences. Generally speaking, a sequence can contain anything from 0 to 7 digits, followed by from 0 to 7 lowercase letters, followed by from 0 to 7 uppercase letters. As long as the sum of those is 7. This means we can have the partition (1, 3, 3), or (2, 1, 3), or (6, 0, 1), etc. We can use the stars and bars to calculate the various combinations of splitting a sum of N into k bins. There is already an implementation for python, which we'll borrow. The first few partitions are:
[0, 0, 7]
[0, 1, 6]
[0, 2, 5]
[0, 3, 4]
[0, 4, 3]
[0, 5, 2]
[0, 6, 1]
...
Next, we need to calculate how many valid sequences we have within a partition. Since the digit subsequences are independent of the lowercase letters, which are independent of the uppercase letters, we can calculate them individually and multiply them together.
So, how many digit combinations we can have for a length of 4? The first character can be any of the 10 digits, but the second character has only 9 options (ten minus the one that the previous character is). Similarly for the third letter and so on. So the total number of valid subsequences is 10*9*9*9. Similarly, for length 3 for letters, we get 26*25*25. Overall, for the partition, say, (2, 3, 2), we have 10*9*26*25*25*26*25 = 950625000 combinations.
import itertools as it
def partitions(n, k):
for c in it.combinations(xrange(n+k-1), k-1):
yield [b-a-1 for a, b in zip((-1,)+c, c+(n+k-1,))]
def count_subsequences(pool, length):
if length < 2:
return pool**length
return pool * (pool-1)**(length-1)
def count_sequences(length):
counts = [[count_subsequences(i, j) for j in xrange(length+1)] \
for i in [10, 26]]
print 'Partition {:>18}'.format('Sequence count')
total = 0
for a, b, c in partitions(length, 3):
subtotal = counts[0][a] * counts[1][b] * counts[1][c]
total += subtotal
print '{} {:18}'.format((a, b, c), subtotal)
print '\nTOTAL {:22}'.format(total)
Overall, we observe that while generating the sequences fast isn't a problem, there are so many that it can take a long time. Length 7 has 78550354750 (78.5 billion) valid sequences and this number only scales approximately by a factor of 25 with each incremented length.