Hy Py-guys :). Since I am new in the coding world and as well in Python, I don’t have much experience with coding and thus any help would be appreciated. I am working with short tandem repeats in DNA sequences and I would like to have a code that reads and counts the repeated nucleotides based on the tandem motif of specified loci.
Here is an example what I need:
tandem motif:
AGAT,AGAC,[AGAT],gat,[AGAT]
input:
TTAGTTCAGGATAGTAGTTGTTTGGAAGCGCAACTCTCTGAGAAACTTAGTTATTCTCTCATCTATTTAGCTACAGCAAACTTCATGTGACAAAAGCCACACCCATAACTTTTTTCCTCTAGATAGACAGATAGATGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATATAGATTCTCTTTCTCTGCATTCTCATCTATATTTCTGTCTTTCTCTTAATTATGGGTAACTCTTAGCCTGCCAGGCTACCATGGAAAGACAACCTTTAT
analyzed input:
TTAGTTCAGGATAGTAGTTGTTTGGAAGCGCAACTCTCTGAGAAACTTAGTTATTCTCTCATCTATTTAGCTACAGCAAACTTCATGTGACAAAAGCCACACCCATAACTTTTTTCCTCTAGATAGACAGATAGATGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATATAGATTCTCTTTCTCTGCATTCTCATCTATATTTCTGTCTTTCTCTTAATTATGGGTAACTCTTAGCCTGCCAGGCTACCATGGAAAGACAACCTTTAT
output:
AGAT AGAC (AGAT)2 GAT (AGAT)12
- number of copies. (In output GAT is in upper case even if it doesn’t count viz Description)
allele: 16
- total number of copies of each motif (1 + 1 + 2 + 12)
Description
That tandem motif is different for each locus so I need to manually specify it for one and every locus (about 130 loci in total).
So in this case whole motif begins with AGAT
and end with the last copy of AGAT
There is no unknown nucleotide (A/C/T/G) between those specified in tandem motif and everything what is before and after this defined motif should be ignored
As you can see, when in tandem motif there are nucleotides written in lower case (gat), they are not included in the final allele value
Those motifs that are in brackets, these can repeat multiple times
Those that are not in brackets – these have only one copy in the sequence
There can also be this case:
tandem motif:
[CTAT],CTAA,[CTAT],N30,[TATC]
input:
TTTGCATGATCTCTTCTTGATCATTTTCTTCCCCCTTTCCTAAAAAATTCTGGTCCTTTGAGGTAACTGCCATTACCATATGAGTTAGTCTGGGTTCTCCAGAGAAACAGAACCAATAGGCTATCTATCTAACTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTACTATCTCTATATTATCTATCTATCTATTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCATCTATCTATATCTTCTACCAAGTGATTTACTGTAATAAATTAGCTCATGCTATTATGGAGGATGAGTTCAAGATTTGTGGTCAGCAAGTTGCAGACTCA
analyzed input:
TTTGCATGATCTCTTCTTGATCATTTTCTTCCCCCTTTCCTAAAAAATTCTGGTCCTTTGAGGTAACTGCCATTACCATATGAGTTAGTCTGGGTTCTCCAGAGAAACAGAACCAATAGGCTATCTATCTAACTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTACTATCTCTATATTATCTATCTATCTATTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCATCTATCTATATCTTCTACCAAGTGATTTACTGTAATAAATTAGCTCATGCTATTATGGAGGATGAGTTCAAGATTTGTGGTCAGCAAGTTGCAGACTCA
output:
(CTAT)2 CTAA (CTAT)12 (TATC)13
allele: 28
- (2+1+12+13)
Description
N30 means, that there are 30 unspecified nucleotides before final tandem repeat
Summary
There can be these types in motifs, which need to be defined, and each locus would have different combination of motifs:
Brackets: example [CTAT] – multiple copies of CTAT
No brackets: example CTAT – only one copy of CTAT
N#: example N30 - means 30 unspecified nucleotides (A/C/G/T)
Lower case: example ctat - means that these are not included in final allele number
Examples of real motifs:
[CTTT],TT,CT,[CTTT]
[TCTA],[TCTG],[TCTA],ta,[TCTA],tca,[TCTA],tccata,[TCTA],TA,[TCTA]
[TAGA],[CAGA],N48,[TAGA],[CAGA]
[AAGA],[AAGG],[AAGA]
and many more…
Thank you all in advance. Any help and ideas would be appreciated! :)