0

Similar to this and many other questions, I have many nested loops (up to 16) of the same structure.

Problem: I have 4-letter alphabet and want to get all possible words of length 16. I need to filter those words. These are DNA sequences (hence 4 letter: ATGC), filtering rules are quite simple:

  • no XXXX substrings (i.e. can't have same letter in a row more than 3 times, ATGCATGGGGCTA is "bad")
  • specific GC content, that is number of Gs + number of Cs should be in specific range (40-50%). ATATATATATATA and GCGCGCGCGCGC are bad words

itertools.product will work for that, but data structure here gonna be giant (4^16 = 4*10^9 words)

More importantly, if I do use product, then I still have to go through each element to filter it out. Thus I will have 4 billion steps times 2

My current solution is nested for loops

alphabet = ['a','t','g','c']
for p1 in alphabet:
    for p2 in alphabet:
       for p3 in alphabet:
       ...skip...
          for p16 in alphabet:
             word = p1+p2+p3+...+p16
             if word_is_good(word):
                 good_words.append(word)
                 counter+=1

Is there good pattern to program that without 16 nested loops? Is there a way to parallelize it efficiently (on multi-core or multiple EC2 nodes) Also with that pattern i can plug word_is_good? check inside middle of the loops: word that starts badly is bad

...skip...
for p3 in alphabet:
   word_3 = p1+p2+p3
   if not word_is_good(word_3):
     break
   for p4 in alphabet:
     ...skip...
  • Possible duplicate of [How do I generate permutations of length LEN given a list of N Items?](https://stackoverflow.com/questions/9338052/how-do-i-generate-permutations-of-length-len-given-a-list-of-n-items) – GalAbra May 24 '18 at 22:49
  • 2
    `product` is a lazy iterator, so it doesn't build up that massive data structure; it will use `O(16)` storage, not `O(4**16)`. But of course it will still take `O(4**16)` time. But your problem inherently requires that—you want to look at all `4**16` values, except for a small number that can be filtered out. You need to rethink your problem to see if you can find what you want in a smaller search space, not look for a more efficient way to exhaustively search that space. – abarnert May 24 '18 at 22:50
  • @GalAbra i know how to construct permutations, i am looking for suggestion for filtering very large lists – aaaaa says reinstate Monica May 24 '18 at 22:52
  • Meanwhile, you _can_ parallelize it, but the easiest way to do that is to parallelize at one particular level of the tree—e.g., turn the outermost loop into a `executor.map`, or equivalently pull one rep out of `product` to turn into an `executor.map`. But that's only going to give you a constant-time improvement. Say it makes it 7x faster. So you're now `O(4**16 / 7)`. Is that really acceptable? – abarnert May 24 '18 at 22:52
  • @abarnert thanks, i didn't know `product` is lazy. Another approach i had in mind is probabilistic: generate all random, good words, perhaps missing few of those. I am using brute-force right now, i aknowledge – aaaaa says reinstate Monica May 24 '18 at 23:00
  • 2
    You have to tell us exactly what you want to filter out. That will significantly affect the answer. There isn't a general pattern for any filter function. – Alex Hall May 24 '18 at 23:20
  • If the filtering really is just a couple of rules that don't filter out all that much (like your "no letter repeated 4 times"), so there are, say, 3.2 billion good words, and your goal is to generate all of those good words (which is what you asked for in your question), iterating through all 4 billion possibilities is going to be a whole better than doing it probablistically, and optimizing the way you filter out 20% of them isn't likely to speed things up enough to be worth adding much extra complexity. – abarnert May 24 '18 at 23:24
  • What are you trying to do? Can you just generate them? – user202729 May 25 '18 at 06:01
  • @AlexHall added note on the real problem. its genetic DNA sequences we want to go through – aaaaa says reinstate Monica May 25 '18 at 15:21

2 Answers2

1

Since you happen to have an alphabet of length 4 (or any "power of 2 integer"), the idea of using and integer ID and bit-wise operations comes to mind instead of checking for consecutive characters in strings. We can assign an integer value to each of the characters in alphabet, for simplicity lets use the index corresponding to each letter.

Example:

6546354310 = 33212321033134 = 'aaaddcbcdcbaddbd'

The following function converts from a base 10 integer to a word using alphabet.

def id_to_word(word_id, word_len):
    word = ''
    while word_id:
        rem = word_id & 0x3  # 2 bits pet letter
        word = ALPHABET[rem] + word
        word_id >>= 2  # Bit shift to the next letter
    return '{2:{0}>{1}}'.format(ALPHABET[0], word_len, word)

Now for a function to check whether a word is "good" based on its integer ID. The following method is of a similar format to id_to_word, except a counter is used to keep track of consecutive characters. The function will return False if the maximum number of identical consecutive characters is exceeded, otherwise it returns True.

def check_word(word_id, max_consecutive):
    consecutive = 0
    previous = None
    while word_id:
        rem = word_id & 0x3
        if rem != previous:
                consecutive = 0
        consecutive += 1
        if consecutive == max_consecutive + 1:
            return False
        word_id >>= 2
        previous = rem
    return True

We're effectively thinking of each word as an integer with base 4. If the Alphabet length was not a "power of 2" value, then modulo % alpha_len and integer division // alpha_len could be used in place of & log2(alpha_len) and >> log2(alpha_len) respectively, although it would take much longer.

Finally, finding all the good words for a given word_len. The advantage of using a range of integer values is that you can reduce the number of for-loops in your code from word_len to 2, albeit the outer loop is very large. This may allow for more friendly multiprocessing of your good word finding task. I have also added in a quick calculation to determine the smallest and largest IDs corresponding to good words, which helps significantly narrow down the search for good words

ALPHABET = ('a', 'b', 'c', 'd')

def find_good_words(word_len):
    max_consecutive = 3
    alpha_len = len(ALPHABET)

    # Determine the words corresponding to the smallest and largest ids
    smallest_word = ''  # aaabaaabaaabaaab
    largest_word = ''   # dddcdddcdddcdddc
    for i in range(word_len):
        if (i + 1) % (max_consecutive + 1):
            smallest_word = ALPHABET[0] + smallest_word
            largest_word = ALPHABET[-1] + largest_word
        else:
            smallest_word = ALPHABET[1] + smallest_word
            largest_word = ALPHABET[-2] + largest_word

    # Determine the integer ids of said words
    trans_table = str.maketrans({c: str(i) for i, c in enumerate(ALPHABET)})

    smallest_id = int(smallest_word.translate(trans_table), alpha_len)  # 1077952576
    largest_id = int(largest_word.translate(trans_table), alpha_len)    # 3217014720

    # Find and store the id's of "good" words
    counter = 0
    goodies = []

    for i in range(smallest_id, largest_id + 1):
        if check_word(i, max_consecutive):
            goodies.append(i)
            counter += 1

In this loop I have specifically stored the word's ID as opposed to the actual word itself incase you are going to use the words for further processing. However, if you are just after the words then change the second to last line to read goodies.append(id_to_word(i, word_len)).

NOTE: I receive a MemoryError when attempting to store all good IDs for word_len >= 14. I suggest writing these IDs/words to a file of some sort!

1
from itertools import product, islice
from time import time

length = 16

def generate(start, alphabet):
    """
    A recursive generator function which works like itertools.product
    but restricts the alphabet as it goes based on the letters accumulated so far.
    """

    if len(start) == length:
        yield start
        return

    gcs = start.count('g') + start.count('c')
    if gcs >= length * 0.5:
        alphabet = 'at'

    # consider the maximum number of Gs and Cs we can have in the end
    # if we add one more A/T now
    elif length - len(start) - 1 + gcs < length * 0.4:
        alphabet = 'gc'

    for c in alphabet:
        if start.endswith(c * 3):
            continue

        for string in generate(start + c, alphabet):
            yield string

def brute_force():
    """ Straightforward method for comparison """
    lower = length * 0.4
    upper = length * 0.5
    for s in product('atgc', repeat=length):
        if lower <= s.count('g') + s.count('c') <= upper:
            s = ''.join(s)
            if not ('aaaa' in s or
                    'tttt' in s or
                    'cccc' in s or
                    'gggg' in s):
                yield s

def main():
    funcs = [
        lambda: generate('', 'atgc'),
        brute_force
    ]

    # Testing performance
    for func in funcs:

        # This needs to be big to get an accurate measure,
        # otherwise `brute_force` seems slower than it really is.
        # This is probably because of how `itertools.product`
        # is implemented.
        count = 100000000
        start = time()
        for _ in islice(func(), count):
            pass
        print(time() - start)

    # Testing correctness
    global length
    length = 12
    for x, y in zip(*[func() for func in funcs]):
        assert x == y, (x, y)

main()

On my machine, generate was just a bit faster than brute_force, at about 390 seconds vs 425. This was pretty much as fast as I could make them. I think the full thing would take about 2 hours. Of course, actually processing them will take much longer. The problem is that your constraints don't reduce the full set much.

Here's an example of how to use this in parallel across 16 processes:

from multiprocessing.pool import Pool

alpha = 'atgc'

def generate_worker(start):
    start = ''.join(start)
    for s in generate(start, alpha):
        print(s)

Pool(16).map(generate_worker, product(alpha, repeat=2))
Alex Hall
  • 34,833
  • 5
  • 57
  • 89