I have a list of sequences:
CAGGTAGCCA
CCGGTCAGAT
AGGGTTTGAT
TTGGTGAGGC
CAAGTATGAG
ACTGTATGCA
CTGGTAACCA
Each sequence is 10 nucleotides long. I want to calculate the dinucleotide frequency of AA, CC, CG ..etc for each of the dinucleotide positions.
For example, 1st two column will have these bases CA,CC,AG,TT etc, and second two should be AG,CG,GG,TG and so on. The dummy output should be something like this:
AA 0.25 0.34 0.56 0.43 0.00 0.90 0.45 0.34 0.31 0.01
CC 0.45 0.40 0.90 0.00 0.40 0.90 0.30 0.25 0.2 0.10
GC 0.00 0.00 0.34 1.00 0.30 0.30 0.35 0.90 0.1 0.30
TC 0.24 0.56 0.00 0.00 1.00 0.34 0.45 0.35 0.36 0.45
AA, CC, GC and TC could be any dinucleotide combination.
My try:
import itertools
hl = []
#making all possible dinucleotide combinations
bases=['A','T','G','C']
k = 2
kmers = [''.join(p) for p in itertools.product(bases, repeat=k)]
kmers = dict.fromkeys(kmers, 0)
print(kmers)
for i in range(9):
hl.append(kmers)
# my seq file
f = open("H3K27ac.seq")
nLines = 0
for line in f:
id = 0
# trying to read dinucleotide at each loop
for (fst, sec) in zip(line[0::2], line[1::2]):
id = id + 1
hl[id][fst+sec] += 1
nLines += 1
f.close()
nLines = float(nLines)
for char in ['AA', 'TT', 'CC', 'GG', 'CG', 'GC']:
print("{}\t{}".format(char, "\t".join(["{:0.2f}".format(x[char]/nLines) for x in hl])))
By somehow it's estimation the total count of all dinucleotide in whole file. I am trying to make it more general so that it can be used for tri- and any further number of nucleotides.