calculating kmer nucleotide frequency per column

Question

I have a list of sequences:

CAGGTAGCCA
CCGGTCAGAT
AGGGTTTGAT
TTGGTGAGGC
CAAGTATGAG
ACTGTATGCA
CTGGTAACCA

Each sequence is 10 nucleotides long. I want to calculate the dinucleotide frequency of AA, CC, CG ..etc for each of the dinucleotide positions.

For example, 1st two column will have these bases CA,CC,AG,TT etc, and second two should be AG,CG,GG,TG and so on. The dummy output should be something like this:

AA   0.25    0.34    0.56    0.43    0.00    0.90    0.45    0.34    0.31    0.01
CC   0.45    0.40    0.90    0.00    0.40    0.90    0.30    0.25    0.2     0.10
GC   0.00    0.00    0.34    1.00    0.30    0.30    0.35    0.90    0.1     0.30
TC   0.24    0.56    0.00    0.00    1.00    0.34    0.45    0.35    0.36    0.45

AA, CC, GC and TC could be any dinucleotide combination.

My try:

import itertools
hl = []

#making all possible dinucleotide combinations
bases=['A','T','G','C']
k = 2

kmers = [''.join(p) for p in itertools.product(bases, repeat=k)]
kmers = dict.fromkeys(kmers, 0)
print(kmers)
for i in range(9):
    hl.append(kmers)

# my seq file
f = open("H3K27ac.seq")
nLines = 0

for line in f:
    id = 0
# trying to read dinucleotide at each loop
    for (fst, sec) in zip(line[0::2], line[1::2]):
        id = id + 1
        hl[id][fst+sec] += 1
    nLines += 1
f.close()

nLines = float(nLines)
for char in ['AA', 'TT', 'CC', 'GG', 'CG', 'GC']:   
    print("{}\t{}".format(char, "\t".join(["{:0.2f}".format(x[char]/nLines) for x in hl])))

By somehow it's estimation the total count of all dinucleotide in whole file. I am trying to make it more general so that it can be used for tri- and any further number of nucleotides.

Paul M. · Accepted Answer · 2021-08-04T23:08:15.417

Can you elaborate a bit more on what you mean by "frequency for each of the dinucleotide positions"? The following code doesn't compute any kind of percentage or frequency, but it may be helpful for iterating over the columns:

seqs = """CAGGTAGCCA
CCGGTCAGAT
AGGGTTTGAT
TTGGTGAGGC
CAAGTATGAG
ACTGTATGCA
CTGGTAACCA""".splitlines()

def chunkwise(it, n=2):
    from itertools import tee
    its = tee(it, n)
    for index, it in enumerate(its):
        for _ in range(index):
            _ = next(it, None)
    yield from zip(*its)

for col in zip(*map(chunkwise, seqs)):
    print(col)

Output:

(('C', 'A'), ('C', 'C'), ('A', 'G'), ('T', 'T'), ('C', 'A'), ('A', 'C'), ('C', 'T'))
(('A', 'G'), ('C', 'G'), ('G', 'G'), ('T', 'G'), ('A', 'A'), ('C', 'T'), ('T', 'G'))
(('G', 'G'), ('G', 'G'), ('G', 'G'), ('G', 'G'), ('A', 'G'), ('T', 'G'), ('G', 'G'))
(('G', 'T'), ('G', 'T'), ('G', 'T'), ('G', 'T'), ('G', 'T'), ('G', 'T'), ('G', 'T'))
(('T', 'A'), ('T', 'C'), ('T', 'T'), ('T', 'G'), ('T', 'A'), ('T', 'A'), ('T', 'A'))
(('A', 'G'), ('C', 'A'), ('T', 'T'), ('G', 'A'), ('A', 'T'), ('A', 'T'), ('A', 'A'))
(('G', 'C'), ('A', 'G'), ('T', 'G'), ('A', 'G'), ('T', 'G'), ('T', 'G'), ('A', 'C'))
(('C', 'C'), ('G', 'A'), ('G', 'A'), ('G', 'G'), ('G', 'A'), ('G', 'C'), ('C', 'C'))
(('C', 'A'), ('A', 'T'), ('A', 'T'), ('G', 'C'), ('A', 'G'), ('C', 'A'), ('C', 'A'))
>>>

EDIT - I would use collections.Counter to count the "kmers" in the columns, because you don't need to construct a dictionary with all zeros first. If you try to access a non-existent key-value pair in a counter, it will return zero by default:

from collections import Counter
from itertools import product

seqs = """CAGGTAGCCA
CCGGTCAGAT
AGGGTTTGAT
TTGGTGAGGC
CAAGTATGAG
ACTGTATGCA
CTGGTAACCA""".splitlines()

def chunkwise(it, n=2):
    from itertools import tee
    its = tee(it, n)
    for index, it in enumerate(its):
        for _ in range(index):
            _ = next(it, None)
    yield from zip(*its)

k = 2

kmers = ["".join(p) for p in product("ATGC", repeat=k)]

counters = []
for col in zip(*[chunkwise(seq, n=k) for seq in seqs]):
    counters.append(Counter("".join(chunk) for chunk in col))

for kmer in kmers:
    print("{}   {}".format(kmer, "    ".join("{:.2f}".format(c[kmer]/sum(c.values())) for c in counters)))

Output:

AA   0.00    0.14    0.00    0.00    0.00    0.14    0.00    0.00    0.00
AT   0.00    0.00    0.00    0.00    0.00    0.29    0.00    0.00    0.29
AG   0.14    0.14    0.14    0.00    0.00    0.14    0.29    0.00    0.14
AC   0.14    0.00    0.00    0.00    0.00    0.00    0.14    0.00    0.00
TA   0.00    0.00    0.00    0.00    0.57    0.00    0.00    0.00    0.00
TT   0.14    0.00    0.00    0.00    0.14    0.14    0.00    0.00    0.00
TG   0.00    0.29    0.14    0.00    0.14    0.00    0.43    0.00    0.00
TC   0.00    0.00    0.00    0.00    0.14    0.00    0.00    0.00    0.00
GA   0.00    0.00    0.00    0.00    0.00    0.14    0.00    0.43    0.00
GT   0.00    0.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00
GG   0.00    0.14    0.71    0.00    0.00    0.00    0.00    0.14    0.00
GC   0.00    0.00    0.00    0.00    0.00    0.00    0.14    0.14    0.14
CA   0.29    0.00    0.00    0.00    0.00    0.14    0.00    0.00    0.43
CT   0.14    0.14    0.00    0.00    0.00    0.00    0.00    0.00    0.00
CG   0.00    0.14    0.00    0.00    0.00    0.00    0.00    0.00    0.00
CC   0.14    0.00    0.00    0.00    0.00    0.00    0.00    0.29    0.00
>>>

if you see the first two columns for one dinucleotide, you will find CA,CC,AG,TT, CA, AT and CT. So here CC frequency will be 1 out of 7 or 1/7 = 0.14. Here no dinucleotide repeated twice so all the frequency will be 0.14, but frequency of AA, GG, TC and all other will be zero. you generated the matrix correctly, so does I, but I am unable to get the frequencies. — kashiff007, Aug 04 '21 at 20:51

Alex Reynolds · Answer 2 · 2021-08-04T23:37:52.403

Look at lines like this:

for (fst, sec) in zip(line[0::2], line[1::2]):

This locks you into 2-mers:

>>> l = "CAGGTAGCCA"
>>> for (fst, sec) in zip(l[0::2], l[1::2]): print('{}{}'.format(fst, sec))
... 
CA
GG
TA
GC
CA

If you want to generalize for kmer frequencies of any particular value of k, look for lines of that sort and what start/stop indices you are using for iterating over the sequence, to retrieve keys used for counting purposes.

Perhaps instead use a slice approach to get kmers of longer lengths, using the correct start offset and length values e.g.:

>>> k = 3
>>> for x in range(len(l)-k+1): print(l[x:x+k])
... 
CAG
AGG
GGT
GTA
TAG
AGC
GCC
CCA

Or:

>>> k = 4
>>> for x in range(len(l)-k+1): print(l[x:x+k])
... 
CAGG
AGGT
GGTA
GTAG
TAGC
AGCC
GCCA

It can be useful to profile or time how long it takes to generate the kmers for use as keys. One approach is to use a list and append to it, another is to construct a list comprehension.

One way to profile quickly to use timeit, by first specifying two functions for constructing the kmer list:

>>> def f1(l, k):
...     a=[]
...     for x in range(len(l)-k+1): a.append(l[x:x+k])
...     return a

>>> def f2(l, k):
...     return [l[x:x+k] for x in range(len(l)-k+1)]

And then running the timeit function from the timeit class. Here are some adhoc results from my system:

>>> import timeit
>>> timeit.timeit('f1("CAGGTAGCCA", 3)', globals=globals())
1.3822405610000033
>>> timeit.timeit('f2("CAGGTAGCCA", 3)', globals=globals())
1.51058234300001

I was surprised that the list comprehension was slower than appending to a list, but that's why profiling can be useful.

calculating kmer nucleotide frequency per column

2 Answers2