Sorry in advance for the long, potentially wandering explanation. :)
The approach that I am taking, because I am specifically working with DNA sequencing, is to map each base to a number, create functions to map between the bases and the values that I assign to them, then create an array with a slot for each of the combinations of k bases (4^k), then use a for loop that essentially acts as a sliding window, incrementing the corresponding slot in the array for each k-length sequence. The algorithm then searches the array for the highest number and uses the index in the array to find the sequence(s) that are the most prevalent.
I am currently getting the following error:
Traceback (most recent call last):
line 74, in <module>
recFreqWords('ACGTTGCATGTCGCATGATGCATGAGAGCT', 4)
line 65, in recFreqWords
freqArray = computeFreqArray(text, k)
line 59, in computeFreqArray
freqArray[index] += 1
IndexError: list index out of range
The original code is below:
def baseToOffset(base):
if base == 'A':
return 0
if base == 'C':
return 1
if base == 'G':
return 2
if base == 'T':
return 3
def offsetToBase(offset):
if offset == 0:
return 'A'
if offset == 1:
return 'C'
if offset == 2:
return 'G'
if offset == 3:
return 'T'
def patternToNumber(pattern):
if len(pattern) == 0:
return 0
symbol = pattern[len(pattern)-2]
print(symbol)
prefix = pattern[:-1]
print(prefix)
offset = baseToOffset(symbol)
print(offset)
return (4 * patternToNumber(prefix) + offset)
def numberToPattern(index, k):
if k == 1:
return offsetToBase(index)
else:
prefixIndex = index // 4
remainder = index % 4
symbol = offsetToBase(remainder)
prefixPattern = numberToPattern(prefixIndex, (k-1))
return prefixPattern + symbol
def computeFreqArray(text, k):
freqArray = [0] * (4**k)
loopLimit = len(text) - k
for i in range(0, loopLimit):
pattern = text[i:(i+k+1)]
index = patternToNumber(pattern)
freqArray[index] += 1
return freqArray
def recFreqWords(text, k):
freqPatterns = []
freqArray = computeFreqArray(text, k)
maxCount = max(freqArray)
loopLimit = (4**k) - 1
for i in range(0, loopLimit):
if freqArray[i] == maxCount:
pattern = numberToPattern(i, k)
freqPatterns.append(pattern)
return freqPatterns