Here is a way to do it using repeat
and product
from itertools
:
from itertools import product, repeat
# This one returns a list, like your version:
def list_all_kmers(k):
return ["".join(nucls) for nucls in product(*repeat("ACGT", k))]
# This one generates k-mers one by one:
def generate_all_kmers(k):
# It seems "return" also works
# I'm not sure it makes a difference here
# but see https://stackoverflow.com/a/45620965/1878788
yield from ("".join(nucls) for nucls in product(*repeat("ACGT", k)))
for kmer in generate_all_kmers(3):
print(kmer)
Result:
AAA
AAC
AAG
AAT
ACA
ACC
ACG
ACT
AGA
AGC
AGG
AGT
ATA
ATC
ATG
ATT
CAA
CAC
CAG
CAT
CCA
CCC
CCG
CCT
CGA
CGC
CGG
CGT
CTA
CTC
CTG
CTT
GAA
GAC
GAG
GAT
GCA
GCC
GCG
GCT
GGA
GGC
GGG
GGT
GTA
GTC
GTG
GTT
TAA
TAC
TAG
TAT
TCA
TCC
TCG
TCT
TGA
TGC
TGG
TGT
TTA
TTC
TTG
TTT
Some explanations:
repeat("ACGT", k)
generates k times "ACGT"
. This can be visualized when making a list from it:
list(repeat("ACGT", 3))
Result:
['ACGT', 'ACGT', 'ACGT']
product(l1, l2, l3)
generates all tuples having the first element from l1
, the second from l2
and the third from l3
where l1
, l2
and l3
are "iterables", for instance lists or strings. This works with any number or iterables:
Zero:
list(product())
Result:
[()]
One:
list(product("ACGT"))
Result:
[('A',), ('C',), ('G',), ('T',)]
Two:
list(product("ACGT", "ACGT"))
Result:
[('A', 'A'),
('A', 'C'),
('A', 'G'),
('A', 'T'),
('C', 'A'),
('C', 'C'),
('C', 'G'),
('C', 'T'),
('G', 'A'),
('G', 'C'),
('G', 'G'),
('G', 'T'),
('T', 'A'),
('T', 'C'),
('T', 'G'),
('T', 'T')]
Etc.
However, if we want to use the result of repeat
, we must use the *
to say that the generated elements have to be taken as separate arguments. In a function call f(*[l1, l2, l3])
is like doing f(l1, l2, l3)
. It works also if you use a generator instead of a list, so we don't need to do list(repeat(...))
(we just did it above for visualization purpose).
Then we want to make strings out of the elements in the tuples. This is achieved thanks to the join
method of an empty string, that we use in a "list comprehension" (between []
) or a "generator expression" (between ()
).
The list comprehension creates the full list, while the generator expressions generate the elements one by one, "on demand".