1

How can I simplify the for loops in this function by using k argument?

def PatternGenerate(k):
    base = ['A','C','G','T']
    pattern = []
    for x in base:
        for y in base:
            for z in base:
                result = str(x) + str(y) + str(z)
                pattern.append(result)
    return pattern

I've got the result that I want but not the function:

['AAA', 'AAC', 'AAG', 'AAT', 'ACA', 'ACC', 'ACG', 'ACT', 'AGA', 'AGC', 'AGG', 'AGT', 'ATA', 'ATC', 'ATG', 'ATT', 'CAA', 'CAC', 'CAG', 'CAT', 'CCA', 'CCC', 'CCG', 'CCT', 'CGA', 'CGC', 'CGG', 'CGT', 'CTA', 'CTC', 'CTG', 'CTT', 'GAA', 'GAC', 'GAG', 'GAT', 'GCA', 'GCC', 'GCG', 'GCT', 'GGA', 'GGC', 'GGG', 'GGT', 'GTA', 'GTC', 'GTG', 'GTT', 'TAA', 'TAC', 'TAG', 'TAT', 'TCA', 'TCC', 'TCG', 'TCT', 'TGA', 'TGC', 'TGG', 'TGT', 'TTA', 'TTC', 'TTG', 'TTT']

David Faber
  • 12,277
  • 2
  • 29
  • 40
Lam Thinh
  • 21
  • 2
  • 6
    Possible duplicate of [How to get all possible combinations of a list’s elements?](https://stackoverflow.com/questions/464864/how-to-get-all-possible-combinations-of-a-list-s-elements) like `for combo in itertools.combinations('AAACCCGGGTTT',3): print combo` – tk421 Feb 01 '19 at 21:14
  • @tk421, output of your command is different: for example, it produces `'GTT'` six times. – Andriy Makukha Feb 01 '19 at 21:32
  • @AndriyMakukha, yeah, you need to filter it to a set or similar, to remove duplicates. – tk421 Feb 01 '19 at 22:17

4 Answers4

1

One way of doing it is with recursion. Here is an example of generator function to do this:

def genAll(depth, base = ['A','C','G','T']):
    if depth <= 0:
        yield ''
    else:
        for char in base:
            for tail in genAll(depth - 1, base):
                yield char + tail

for comb in genAll(2):
    print(comb)

Output:

AA
AC
AG
AT
CA
CC
CG
CT
GA
GC
GG
GT
TA
TC
TG
TT
Andriy Makukha
  • 7,580
  • 1
  • 38
  • 49
1

It will be much easier if you use a recursive form.

def PatternGenerate(k):
    base = ['A','C','G','T']
    pattern = []
    if k == 1:
      return base
    else:
      for p in PatternGenerate(k-1):
        for b in base:
          pattern.append(p+b)

      return pattern

Explain The idea is: if k == 1, simple return base if k > 1, find PatternGenerate(k-1) and append it with base.

digitake
  • 846
  • 7
  • 16
  • Very similar approach to mine, but it will generate entire array at once, which might be suboptimal. Generator functions (with `yield`) are generally preferred and can lead to higher performance. – Andriy Makukha Feb 01 '19 at 21:16
0

Lazier version using itertools

import itertools
k = 2
result = ["".join(t) for t in itertools.combinations_with_replacement(['A','C','G','T'], k)]
print(result)

The implementation inside combinations_with_replacement is very similar to that of @Andriy.

digitake
  • 846
  • 7
  • 16
  • Output is different from expected. `combinations_with_replacement` apparently treats results as unordered sets (even though it returns tuples of elements). – Andriy Makukha Feb 01 '19 at 21:35
0

Here is a way to do it using repeat and product from itertools:

from itertools import product, repeat

# This one returns a list, like your version:
def list_all_kmers(k):
    return ["".join(nucls) for nucls in product(*repeat("ACGT", k))]

# This one generates k-mers one by one:
def generate_all_kmers(k):
    # It seems "return" also works
    # I'm not sure it makes a difference here
    # but see https://stackoverflow.com/a/45620965/1878788
    yield from ("".join(nucls) for nucls in product(*repeat("ACGT", k)))

for kmer in generate_all_kmers(3):
    print(kmer)

Result:

AAA
AAC
AAG
AAT
ACA
ACC
ACG
ACT
AGA
AGC
AGG
AGT
ATA
ATC
ATG
ATT
CAA
CAC
CAG
CAT
CCA
CCC
CCG
CCT
CGA
CGC
CGG
CGT
CTA
CTC
CTG
CTT
GAA
GAC
GAG
GAT
GCA
GCC
GCG
GCT
GGA
GGC
GGG
GGT
GTA
GTC
GTG
GTT
TAA
TAC
TAG
TAT
TCA
TCC
TCG
TCT
TGA
TGC
TGG
TGT
TTA
TTC
TTG
TTT

Some explanations:

repeat("ACGT", k) generates k times "ACGT". This can be visualized when making a list from it:

list(repeat("ACGT", 3))

Result:

['ACGT', 'ACGT', 'ACGT']

product(l1, l2, l3) generates all tuples having the first element from l1, the second from l2 and the third from l3 where l1, l2 and l3 are "iterables", for instance lists or strings. This works with any number or iterables:

Zero:

list(product())

Result:

[()]

One:

list(product("ACGT"))

Result:

[('A',), ('C',), ('G',), ('T',)]

Two:

list(product("ACGT", "ACGT"))

Result:

[('A', 'A'),
 ('A', 'C'),
 ('A', 'G'),
 ('A', 'T'),
 ('C', 'A'),
 ('C', 'C'),
 ('C', 'G'),
 ('C', 'T'),
 ('G', 'A'),
 ('G', 'C'),
 ('G', 'G'),
 ('G', 'T'),
 ('T', 'A'),
 ('T', 'C'),
 ('T', 'G'),
 ('T', 'T')]

Etc.

However, if we want to use the result of repeat, we must use the * to say that the generated elements have to be taken as separate arguments. In a function call f(*[l1, l2, l3]) is like doing f(l1, l2, l3). It works also if you use a generator instead of a list, so we don't need to do list(repeat(...)) (we just did it above for visualization purpose).

Then we want to make strings out of the elements in the tuples. This is achieved thanks to the join method of an empty string, that we use in a "list comprehension" (between []) or a "generator expression" (between ()).

The list comprehension creates the full list, while the generator expressions generate the elements one by one, "on demand".

bli
  • 7,549
  • 7
  • 48
  • 94