1

Sorry in advance for the long, potentially wandering explanation. :)

The approach that I am taking, because I am specifically working with DNA sequencing, is to map each base to a number, create functions to map between the bases and the values that I assign to them, then create an array with a slot for each of the combinations of k bases (4^k), then use a for loop that essentially acts as a sliding window, incrementing the corresponding slot in the array for each k-length sequence. The algorithm then searches the array for the highest number and uses the index in the array to find the sequence(s) that are the most prevalent.

I am currently getting the following error:

Traceback (most recent call last):
  line 74, in <module>
    recFreqWords('ACGTTGCATGTCGCATGATGCATGAGAGCT', 4)
  line 65, in recFreqWords
    freqArray = computeFreqArray(text, k)
  line 59, in computeFreqArray
    freqArray[index] += 1
IndexError: list index out of range

The original code is below:

def baseToOffset(base):
    if base == 'A':
        return 0
    if base == 'C':
        return 1
    if base == 'G':
        return 2
    if base == 'T':
        return 3

def offsetToBase(offset):
    if offset == 0:
        return 'A'
    if offset == 1:
        return 'C'
    if offset == 2:
        return 'G'
    if offset == 3:
        return 'T'

def patternToNumber(pattern):
    if len(pattern) == 0:
        return 0
    symbol = pattern[len(pattern)-2]
    print(symbol)
    prefix = pattern[:-1]
    print(prefix)
    offset = baseToOffset(symbol)
    print(offset)
    return (4 * patternToNumber(prefix) + offset)

def numberToPattern(index, k):
    if k == 1:
        return offsetToBase(index)
    else:
        prefixIndex = index // 4
        remainder = index % 4
        symbol = offsetToBase(remainder)
        prefixPattern = numberToPattern(prefixIndex, (k-1))
        return prefixPattern + symbol

def computeFreqArray(text, k):
    freqArray = [0] * (4**k)
    loopLimit = len(text) - k
    for i in range(0, loopLimit):
        pattern = text[i:(i+k+1)]
        index = patternToNumber(pattern)
        freqArray[index] += 1
    return freqArray

def recFreqWords(text, k):
    freqPatterns = []
    freqArray = computeFreqArray(text, k)
    maxCount = max(freqArray)
    loopLimit = (4**k) - 1
    for i in range(0, loopLimit):
        if freqArray[i] == maxCount:
            pattern = numberToPattern(i, k)
            freqPatterns.append(pattern)
    return freqPatterns
Barmar
  • 741,623
  • 53
  • 500
  • 612
Juhi C.
  • 13
  • 2
  • 2
    The context in which you run this makes me think that speed will be important. Is there a reason you are using recursion here? I expect it to be significantly slower than equivalent loops. – Cireo Apr 06 '21 at 00:11
  • This was for a class, so it was mostly a question set to illustrate a type of problem rather than articulating the most efficient solution – Juhi C. Apr 23 '21 at 04:36

1 Answers1

2

I think your problem can instead be boiled down to

collections.Counter(windows(dna, 4))

where windows is a function of your choice, perhaps a performant one like in How to implement a sliding window over the string in Python to have each new window constructed in O(1), or perhaps a simple one like

def windows(items, size):
    return (items[i:i + size] for i in range(len(items) - size))

e.g.

dna = 'ACGTTGCATGTCGCATGATGCATGAGAGCT'
collections.Counter(windows(dna, 4))

would give

Counter({'GCAT': 3,
         'CATG': 3,
         'TGCA': 2,
         'ATGA': 2,
         'ACGT': 1,
         'CGTT': 1,
         'GTTG': 1,
         'TTGC': 1,
         'ATGT': 1,
         'TGTC': 1,
         'GTCG': 1,
         'TCGC': 1,
         'CGCA': 1,
         'TGAT': 1,
         'GATG': 1,
         'ATGC': 1,
         'TGAG': 1,
         'GAGA': 1,
         'AGAG': 1,
         'GAGC': 1})
Cireo
  • 4,197
  • 1
  • 19
  • 24