Analyze tandem repeat motifs in DNA sequences

Question

Hy Py-guys :). Since I am new in the coding world and as well in Python, I don’t have much experience with coding and thus any help would be appreciated. I am working with short tandem repeats in DNA sequences and I would like to have a code that reads and counts the repeated nucleotides based on the tandem motif of specified loci.

Here is an example what I need:

tandem motif:

AGAT,AGAC,[AGAT],gat,[AGAT]

input:

TTAGTTCAGGATAGTAGTTGTTTGGAAGCGCAACTCTCTGAGAAACTTAGTTATTCTCTCATCTATTTAGCTACAGCAAACTTCATGTGACAAAAGCCACACCCATAACTTTTTTCCTCTAGATAGACAGATAGATGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATATAGATTCTCTTTCTCTGCATTCTCATCTATATTTCTGTCTTTCTCTTAATTATGGGTAACTCTTAGCCTGCCAGGCTACCATGGAAAGACAACCTTTAT

analyzed input:

TTAGTTCAGGATAGTAGTTGTTTGGAAGCGCAACTCTCTGAGAAACTTAGTTATTCTCTCATCTATTTAGCTACAGCAAACTTCATGTGACAAAAGCCACACCCATAACTTTTTTCCTCTAGATAGACAGATAGATGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATATAGATTCTCTTTCTCTGCATTCTCATCTATATTTCTGTCTTTCTCTTAATTATGGGTAACTCTTAGCCTGCCAGGCTACCATGGAAAGACAACCTTTAT

output:

AGAT AGAC (AGAT)2 GAT (AGAT)12

number of copies. (In output GAT is in upper case even if it doesn’t count viz Description)

allele: 16

total number of copies of each motif (1 + 1 + 2 + 12)

Description

That tandem motif is different for each locus so I need to manually specify it for one and every locus (about 130 loci in total).

So in this case whole motif begins with AGAT and end with the last copy of AGAT

There is no unknown nucleotide (A/C/T/G) between those specified in tandem motif and everything what is before and after this defined motif should be ignored

As you can see, when in tandem motif there are nucleotides written in lower case (gat), they are not included in the final allele value

Those motifs that are in brackets, these can repeat multiple times

Those that are not in brackets – these have only one copy in the sequence

There can also be this case:

tandem motif:

[CTAT],CTAA,[CTAT],N30,[TATC]

input:

TTTGCATGATCTCTTCTTGATCATTTTCTTCCCCCTTTCCTAAAAAATTCTGGTCCTTTGAGGTAACTGCCATTACCATATGAGTTAGTCTGGGTTCTCCAGAGAAACAGAACCAATAGGCTATCTATCTAACTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTACTATCTCTATATTATCTATCTATCTATTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCATCTATCTATATCTTCTACCAAGTGATTTACTGTAATAAATTAGCTCATGCTATTATGGAGGATGAGTTCAAGATTTGTGGTCAGCAAGTTGCAGACTCA

analyzed input:

TTTGCATGATCTCTTCTTGATCATTTTCTTCCCCCTTTCCTAAAAAATTCTGGTCCTTTGAGGTAACTGCCATTACCATATGAGTTAGTCTGGGTTCTCCAGAGAAACAGAACCAATAGGCTATCTATCTAACTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTACTATCTCTATATTATCTATCTATCTATTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCATCTATCTATATCTTCTACCAAGTGATTTACTGTAATAAATTAGCTCATGCTATTATGGAGGATGAGTTCAAGATTTGTGGTCAGCAAGTTGCAGACTCA

output:

(CTAT)2 CTAA (CTAT)12 (TATC)13

allele: 28

(2+1+12+13)

Description

N30 means, that there are 30 unspecified nucleotides before final tandem repeat

Summary

There can be these types in motifs, which need to be defined, and each locus would have different combination of motifs:

Brackets: example [CTAT] – multiple copies of CTAT

No brackets: example CTAT – only one copy of CTAT

N#: example N30 - means 30 unspecified nucleotides (A/C/G/T)

Lower case: example ctat - means that these are not included in final allele number

Examples of real motifs:

[CTTT],TT,CT,[CTTT]

[TCTA],[TCTG],[TCTA],ta,[TCTA],tca,[TCTA],tccata,[TCTA],TA,[TCTA]

[TAGA],[CAGA],N48,[TAGA],[CAGA]

[AAGA],[AAGG],[AAGA]

and many more…

Thank you all in advance. Any help and ideas would be appreciated! :)

You need to use the formatting that you specified in the `tandem motif` (ex.: you're reading them from somewhere else) or you can change the formatting? you're using python3? — Gsk, Jul 11 '18 at 08:10
It can be changed. It is usually used as I presented, but for coding it can be modified if necessary :). Yes I have Python3. — Majkl, Jul 11 '18 at 08:19
Yes every "item" in motif counts as 1, except those in lower case, those with N## and those that add 0,1 - 0,3 these would be the most problematic I guess. Usually they have 4 letters but there are also cases, which have 3 letters and 5 letters. So sometimes 3-lettered motifs counts as 1 and sometimes they add 0,3 :/ — Majkl, Jul 11 '18 at 08:48
..... to make it more simple: when motifs have 4 lettered "items" that counts as 1, then if there is also 3-lettered item in this motif then it counts as 0,3. (example: [TCTA],TCA,[TCTA]) But when motif has 3-lettered "items" then these count as 1 (example: [CCT],CTT,[TCT],CCT,[TCT]). I hope it makes sense :D — Majkl, Jul 11 '18 at 08:54
I deleted the "0.1, 0.2 etc." problem because it is probably more problematic than I thought at first. — Majkl, Jul 12 '18 at 07:43
If the evaluation has precise rules and you can explain them properly in a new question, on SO you'll find someone able to help you. — Gsk, Jul 12 '18 at 08:30

score 2 · Accepted Answer · answered Jul 11 '18 at 10:14

A good way to solve your problem is to use regex. REGular EXpression are a common way in programming to parse strings.
With regex you can define which pattern you want to search in a string (almost like you did), which is the core of your problem.
This means that regex have their own formatting, similar, but not identical, to yours.
You can also write some code to convert your formatting to regex formatting, but you probably should write another question, avoiding all the DNA stuff.

Lets see how regex works:
Here is how your summary looks like in regex pattern:

Summary

There can be these types in motifs, which need to be defined, and each locus would have different combination of motifs:

Brackets: example [CTAT] – multiple copies of CTAT - RegEx: (CTAT)+ (one or more) or (CTAT)* (zero or more)

No brackets: example CTAT – only one copy of CTAT - RegEx: (CTAT){1}

N#: example N30 - means 30 unspecified nucleotides (A/C/G/T) - RegEx: .{30}

Lower case: example ctat - means that these are not included in final allele number - RegEx: (?:CTAT)

With this knowledge, we can apply our regex to our inputs:
example 1:

import re # import regex module

tandem = r"((AGAT){1}(AGAC){1}(AGAT)+(?:GAT){1}(AGAT)+)"

mystring = "TTAGTTCAGGATAGTAGTTGTTTGGAAGCGCAACTCTCTGAGAAACTTAGTTATTCTCTCATCTATTTAGCTACAGCAAACTTCATGTGACAAAAGCCACACCCATAACTTTTTTCCTCTAGATAGACAGATAGATGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATATAGATTCTCTTTCTCTGCATTCTCATCTATATTTCTGTCTTTCTCTTAATTATGGGTAACTCTTAGCCTGCCAGGCTACCATGGAAAGACAACCTTTAT" #input string

analyzed_input = re.findall(tandem, mystring)[0]

print(analyzed_input) #see the match found

tot = 0
max_len = max((len(al) for al in analyzed_input[1:] if len(al) <= 4)) # longest allele, maximum 4
remaining_string = analyzed_input[0] #string to analyzed. will be cutted in for loop
for allele in analyzed_input[1:]: #for each allele
    section = re.findall(r"((" + re.escape(allele) + ")+)", remaining_string)[0][0] # section where the allele is repeated
    value = section.count(allele) if len(allele) == max_len else section.count(allele)*(len(allele)/10.0) # get the value of the alleles. /10.0 if allele is shorter than the longest allele found
    remaining_string = remaining_string[remaining_string.index(section)+len(section):] # cut away from remaining string the current section
    print("The value of allele {0} is {1}\n".format(allele, value))
    if len(allele) <= 4: #add the allele value if his length is less than 5
        tot += value

print("total allele number is: {0}".format(tot))

OUTPUT: total allele number is: 16

for the next examples, I'm only showing the regex tandem, the rest of the code is the same

example 2:

tandem2 = r"((TCTA)+(TCTG)+(TCTA)+(?:TA){1}(TCTA)+(?:TCA){1}(TCTA)+(?:TCCATA){1}(TCTA)+(TA)+(TCTA)+)"

OUTPUT: total allele number is: 32.2

example 3:

tandem3 = r"((TCTA)+(TCTG)+(TCTA)+(?:TA){1}(TCTA)+(?:TCA){1}(TCTA)+(?:TCCATA){1}(TCTA)+(TA)*(TCTA)*)"

OUTPUT: total allele number is: 31.0

example 4:

tandem4 = r"((CTAT)+(CTAA){1}(CTAT)+(.{30})(TATC)+)"

OUTPUT: total allele number is: 28.0

your other example will be written as:

[CTTT],TT,CT,[CTTT] r"((CTTT)+(TT){1}(CT){1}(CTTT)+)"

[TCTA],[TCTG],[TCTA],ta,[TCTA],tca,[TCTA],tccata,[TCTA],TA,[TCTA] r"((TCTA)+(TCTG)+(TCTA)+(?:TA){1}(TCTA)+(?:TCA){1}(TCTA)+(?:TCCATA){1}(TCTA)+(TA){1}(TCTA)+)"

[TAGA],[CAGA],N48,[TAGA],[CAGA] r"((TAGA)+(CAGA)+(.{48})(TAGA)+(CAGA)+)"

[AAGA],[AAGG],[AAGA] r"((AAGA)+(AAGG)+(AAGA)+)"

Developing a full working framework will take a little bit of time, depending on what level of flexibility you want to achieve, input type, automation...

BTW well written question, even if there are at least 3 different problems to solve (usually you should have 1 problem well identified for each question). see [mcve] — Gsk, Jul 11 '18 at 10:15
Wow, this looks great, Thanks. So I assume that I could maybe create something like a dictionary or maybe tupple, where I would define all the loci names with their regex-ed motifs? So I could then apply this dictionary to raw input data and then analyze it? — Majkl, Jul 11 '18 at 10:39
Yes. If I was developing that part, I'd probably write a little function where `[ATAC]` gets sobstituted with `(ATAC)+)`, `ata` gets sobstituted with `(?:ATA)`, `N42` gets sobstituted with `(.{42})` and so on. Remember that you can also concatenate RegEx strings using `+` sign. I suggest you to develop some code and ask when you get stuck. Notify me if you want, since I already know your requirements — Gsk, Jul 11 '18 at 12:37

score 1 · Answer 2 · answered Jul 11 '18 at 08:31

I am sorry that I do not have the time to finish all cases, but hopefully this will give you something to start with.

The design is a linear automaton whose tape is the sequence of nucleotides.

We have a position (pos) variable that marks the index in the sequence that we are working from.

There are also two running accumulating variables: an output string and an integer count of alleles.

Now we have the setup initialised, we can start our iteration over each of the motifs in the tandem motif string. This is done by splitting the string on commas.

Then inside the for-loop, we need to decide which motif case this is (e.g. square bracket repeating, no brackets, N# etc.). For the sake of time, I have only implemented this for the repeating square brackets as it demonstrates the process easily.

Once a test case has been passed, you need to then handle what steps need to be done.

For instance, in this case with the square brackets, the motif is repeating so I initialised an initial count variable to 0 and then jumped pos to the first occurrence of the motif in sequence if pos was 0 - i.e., if this is our first motif, we need to jump our position to the end of the first occurrence. I also increment the count as we have found one motif.

From here, whilst the next characters in sequence are equal to our motif string, we increment the position by the length of the motif (so it is at the end for the next one) and also increment the count.

Finally, we append the formatted string ((motif)#) to the output string and add the number of motifs (alleles) to the master alleles counter.

We then return our output as a dictionary (you could use a tuple if you wanted).

def dna(tandem_motif, sequence):
    pos = 0
    output = ''
    alleles = 0
    for motif in tandem_motif.split(','):
        if motif[0] == '[' and motif[-1] == ']':
            motif = motif.replace('[', ''). replace(']', '')
            count = 0
            if pos == 0:
                pos = sequence.index(motif) + len(motif)
                count += 1
            while sequence[pos:pos+len(motif)] == motif:
                pos += len(motif)
                count += 1
            output += '({}){}'.format(motif, count)
            alleles += count
        #elif ... :    <-- where you add the criteria for the other motif test cases
    return {'alleles': alleles, 'output': output}

and a test with a very basic case I made up:

>>> dna('[TCTA]', 'TGCAGCATTCTATCTATCTAGCTAAGCC')
{'alleles': 3, 'output': '(TCTA)3'}

which is correct since: 'TGCAGCAT|TCTATCTATCTA|GCTAAGCC'

Analyze tandem repeat motifs in DNA sequences

2 Answers2

Linked