0

I have a script that returns palindrome substrings in a DNA sequence.

sequence="GATCTCTATACCAACTCAAAATGAAGACTCTTCTTTACACTTTCGAGCTCAGCAGGCTTACCGAGAAGAGTCGTCGTTCACATCCCCCCCTGTGCGAGATCAAGAAATTTGGCGACGTCGGCTTATTATCCTCCGCTGTCAATCAGTTGGACACATCTCTCCGGTCACTGCCGGACAAGCCAACCGAAGATTCGATTCTTCAGCAGCTTATCGACATTGCTGGTGGTGAAAAGCCAAGGCACAGCATCATAGTTGCGACCAATACGTCATACGACCGAGAGACATTGGTAAAGATCCTTCAACGATTCCCATACACCATACCTGGTCTGTCAGATTCAGGCTTGGAATCAGAAACACTCGAGGCTCTTGAGCACATCGCTTTTGCATTAGCCGGGCGATTAGCTCATAGATTTGACTACGGGTTCAATCCAGAGGCCAGTATCGTTCAACACCTCGAGATGTTCACCACCCTTTGGCACCAAAGATCTGCATTACCACCTGCGCCTGCCCCGTATCGACTTCCCGTTCCCGTCAATCAAGGAAGAGTCTCCTCATCAGATGATGGCTCTGATACTGAGTCAGAACTGGATGAAAAATACCACAACATCAAGAAGTCAGGACTTTGGAGGTTTCTGGATATGTTCAAAATGAACTTCAAGAGGTCTTAGATAACGGTCTAGTTCTAGTTCTGCAACTCACACTGA"
print(len(sequence))
pairs = {"A":"T", "T":"A", "G":"C", "C":"G"}
for i in range(len(sequence) - 6 + 1):
    pal = True
    for j in range(2):
        if pairs[ sequence[i+j] ] != sequence[i+5-j]:
            pal = False
            break
    if pal:
        print(sequence[i : i+6])

It returns:

704
GATCTC
GAGCTC
GCAGGC
GTTCAC
GAGATC
TCAAGA
AAATTT
GACGTC
CAGTTG
TGGACA
AAGATT
CTTCAG
CCAAGG
CGACCG
TTGGAA
CTCGAG
TCTTGA
CTTGAG
TGAGCA
CGGGCG
ATAGAT
ACGGGT
TCCAGA
CTCGAG
TCGAGA
TGTTCA
GTTCAC
GGCACC
AGATCT
CACCTG
GCCTGC
GACTTC
CAGATG
AGAACT
TCAAGA
GAAGTC
TCAGGA
AGGACT
TCTGGA
TGTTCA
TTCAAA
TCAAGA
GAGGTC
AGGTCT
TAGATA
AGTTCT
AGTTCT

I want to find if these substrings are positioned next to "[ATCG]CC" or "[ATCG]GG" I have in mind to find the position of these palindromes in the sequence (for example from i-th to (i+5)th as palindromes are of length 6) and then check if (i+6)th to (i+8)th letters are [ATCG]CC or [ATCG]GG. Do you know how I can write such script? Or do you have a better logic in mind? Thank you

Debutant
  • 355
  • 5
  • 17
  • 2
    Does this answer your question? [How to check for palindrome using Python logic](https://stackoverflow.com/questions/17331290/how-to-check-for-palindrome-using-python-logic) – Leonardus Chen Dec 16 '20 at 06:13
  • @LeonardusChen I have found the palindromes, but I need to choose among these palindromes those that are placed next to [TACG]CC or [TACG]GG – Debutant Dec 16 '20 at 06:18
  • 2
    can you elaborate what palindromes in this regards means? `AGGTCT` doesnt seem to be a palindrome, yet it exists in the list. Am i missing something? – Akshay Sehgal Dec 16 '20 at 06:18
  • It seems like you'll need to record the position of the palindrome, not just what it was. – Ouroborus Dec 16 '20 at 06:23
  • @AkshaySehgal it's a DNA palindrome. The matching sequence on the other strand is the same if read backwards. – Debutant Dec 16 '20 at 07:10

2 Answers2

1

I am not exactly sure if I am able to get your question correctly, but assuming the values you've got are some kind of Gene Palindrome and then you want the next two values for each one found(correct me if I got it wrong), the simple solution would be somewhat like this:

sequence="GATCTCTATACCAACTCAAAATGAAGACTCTTCTTTACACTTTCGAGCTCAGCAGGCTTACCGAGAAGAGTCGTCGTTCACATCCCCCCCTGTGCGAGATCAAGAAATTTGGCGACGTCGGCTTATTATCCTCCGCTGTCAATCAGTTGGACACATCTCTCCGGTCACTGCCGGACAAGCCAACCGAAGATTCGATTCTTCAGCAGCTTATCGACATTGCTGGTGGTGAAAAGCCAAGGCACAGCATCATAGTTGCGACCAATACGTCATACGACCGAGAGACATTGGTAAAGATCCTTCAACGATTCCCATACACCATACCTGGTCTGTCAGATTCAGGCTTGGAATCAGAAACACTCGAGGCTCTTGAGCACATCGCTTTTGCATTAGCCGGGCGATTAGCTCATAGATTTGACTACGGGTTCAATCCAGAGGCCAGTATCGTTCAACACCTCGAGATGTTCACCACCCTTTGGCACCAAAGATCTGCATTACCACCTGCGCCTGCCCCGTATCGACTTCCCGTTCCCGTCAATCAAGGAAGAGTCTCCTCATCAGATGATGGCTCTGATACTGAGTCAGAACTGGATGAAAAATACCACAACATCAAGAAGTCAGGACTTTGGAGGTTTCTGGATATGTTCAAAATGAACTTCAAGAGGTCTTAGATAACGGTCTAGTTCTAGTTCTGCAACTCACACTGA"

pairs = {"A":"T", "T":"A", "G":"C", "C":"G"}

keeper = []
for i in range(len(sequence) - 6 + 1):
    pal = True
    for j in range(2):
        if pairs[ sequence[i+j] ] != sequence[i+5-j]:
            pal = False
            break
    if pal:
        the_sequence = sequence[i : i+6]
#         print(the_sequence)
        keeper.append((the_sequence, (i, i+6)))
        
possible_ends = [a+'CC' for a in "ATCG"]
possible_ends.extend([a+'GG' for a in "ATCG"])

final = []

for val in keeper:
    temp = val[0]+sequence[val[1][1]:val[1][1]+3]
    
    temp_list = [temp.endswith(a) for a in possible_ends]
    
    if any(temp_list):
        final.append(temp)
    else:
        pass
    
print(final)

Output:

['GCCTGCCCC', 'GAAGTCAGG']

I hope and believe this is the desired output.

Amit Amola
  • 2,301
  • 2
  • 22
  • 37
  • Thank you for your answer. What I am looking for is for the script to find if the GATCTC that it finds is placed next to ACC, TCC, GCC, CCC, AGG, TGG, CGG or GGG, then print those 9 letters (e.g. GATCTCTCC) – Debutant Dec 16 '20 at 07:37
  • So let me clarify: You just want for each of the palindrome you've found, the next three letters after it? So that means, for GATCTC -> GATCTCTCC and for GAGCTC -> GAGCTCAGC and so on? That's all? Or do you want some specific one only or do you want only for GATCTC types only. I am still trying to understand your question actually. – Amit Amola Dec 16 '20 at 07:42
  • 1
    for example, for GATCTC I only want it if it's followed by [ATCG]CC or [ATCG]GG, meaning GATCTCACC or GATCTCTCC or GATCTCGCC or GATCTCCCC and so on. – Debutant Dec 16 '20 at 07:59
  • 1
    Ohhh, I think I understood the question now. Let me try again. – Amit Amola Dec 16 '20 at 08:02
  • 1
    Can you check the new solution. Is this what you were looking for? Also if this is the desired result, optimizing this code is the next thing to try. I mean if this is going to run on a huge lengthy sequences, there might be a possibility of optimizing this code or you know better approaches(faster and efficient) to deal with such data. – Amit Amola Dec 16 '20 at 08:10
1

Just add some extra checks.

sequence="GATCTCTATACCAACTCAAAATGAAGACTCTTCTTTACACTTTCGAGCTCAGCAGGCTTACCGAGAAGAGTCGTCGTTCACATCCCCCCCTGTGCGAGATCAAGAAATTTGGCGACGTCGGCTTATTATCCTCCGCTGTCAATCAGTTGGACACATCTCTCCGGTCACTGCCGGACAAGCCAACCGAAGATTCGATTCTTCAGCAGCTTATCGACATTGCTGGTGGTGAAAAGCCAAGGCACAGCATCATAGTTGCGACCAATACGTCATACGACCGAGAGACATTGGTAAAGATCCTTCAACGATTCCCATACACCATACCTGGTCTGTCAGATTCAGGCTTGGAATCAGAAACACTCGAGGCTCTTGAGCACATCGCTTTTGCATTAGCCGGGCGATTAGCTCATAGATTTGACTACGGGTTCAATCCAGAGGCCAGTATCGTTCAACACCTCGAGATGTTCACCACCCTTTGGCACCAAAGATCTGCATTACCACCTGCGCCTGCCCCGTATCGACTTCCCGTTCCCGTCAATCAAGGAAGAGTCTCCTCATCAGATGATGGCTCTGATACTGAGTCAGAACTGGATGAAAAATACCACAACATCAAGAAGTCAGGACTTTGGAGGTTTCTGGATATGTTCAAAATGAACTTCAAGAGGTCTTAGATAACGGTCTAGTTCTAGTTCTGCAACTCACACTGA"
print(len(sequence))
pairs = {"A":"T", "T":"A", "G":"C", "C":"G"}
ans = []
for i in range(len(sequence) - 9 + 1):
    pal = True
    for j in range(2):
        if pairs[ sequence[i+j] ] != sequence[i+5-j]:
            pal = False
            break
    if not pal:
        continue

    if (sequence[i+7] == sequence[i+8]) and (sequence[i+7] in ('C', 'G')):
        print(sequence[i : i+9])
        ans.append(sequence[i : i+9])
    else:
        print(sequence[i : i+6] + " (X)")
print("Count of answer: %d" % len(ans))

Output:

704
GATCTC (X)
GAGCTC (X)
GCAGGC (X)
GTTCAC (X)
GAGATC (X)
TCAAGA (X)
AAATTT (X)
GACGTC (X)
CAGTTG (X)
TGGACA (X)
AAGATT (X)
CTTCAG (X)
CCAAGG (X)
CGACCG (X)
TTGGAA (X)
CTCGAG (X)
TCTTGA (X)
CTTGAG (X)
TGAGCA (X)
CGGGCG (X)
ATAGAT (X)
ACGGGT (X)
TCCAGA (X)
CTCGAG (X)
TCGAGA (X)
TGTTCA (X)
GTTCAC (X)
GGCACC (X)
AGATCT (X)
CACCTG (X)
GCCTGCCCC
GACTTC (X)
CAGATG (X)
AGAACT (X)
TCAAGA (X)
GAAGTCAGG
TCAGGA (X)
AGGACT (X)
TCTGGA (X)
TGTTCA (X)
TTCAAA (X)
TCAAGA (X)
GAGGTC (X)
AGGTCT (X)
TAGATA (X)
AGTTCT (X)
AGTTCT (X)
Count of answer: 2
AnnieFromTaiwan
  • 3,845
  • 3
  • 22
  • 38
  • Thank you so much. Yes that is what I wanted. – Debutant Dec 16 '20 at 08:22
  • Is this possible to find the number of substrings that match the criteria after printing them just in case my sequence gets longer and it becomes harder to track them visually. – Debutant Dec 16 '20 at 08:34
  • 1
    @Debutant Sure, you can achieve that by just adding 2 lines of code. See my newest edit. – AnnieFromTaiwan Dec 16 '20 at 08:38
  • I try this code for a palindrome of length 18 and it doesn't return the same results regarding the three last letters. Do you know how I can fix this? – Debutant Dec 18 '20 at 19:24