0

I am attempting to use Python to replace certain characters in a list of sequences that will be sent out for synthesis. The characters in question are the first and last three of each sequence. I am also attempting to add a * between each character.

The tricky part is that the first and last character need to be different from the other two.

For example: the DNA sequence TGTACGTTGCTCCGAC would need to be changed to /52MOErT/*/i2MOErG/*/i2MOErT/*A*C*G*T*T*G*C*T*C*C*/i2MOErG/*/i2MOErA/*/32MOErC/

The first character needs to be /52MOEr_/ and the last needs to be /32MOEr_/, where the _ is the character at that index. For the example above it would be T for the first and C for the last. The other two, the GT and GA would need to be /i2MOEr_/ modifications.

So far I have converted the sequences into a list using the .split() function. The end result was ['AAGTCTGGTTAACCAT', 'AATACTAGGTAACTAC', 'TGTACGTTGCTCCGTC', 'TGTAGTTAGCTCCGTC']. I have been playing around for a bit but I feel I need some guidance.

Is this not as easy to do as I thought it would be?

ajs
  • 11
  • 2

2 Answers2

1

You can just use the divide and conquer algorithm. Here's my solution to achieve your goal.

dna = "TGTACGTTGCTCCGAC"
dnaFirst3Chars = '/52MOEr' + dna[0] + '/*/i2MOEr' + dna[1] + '/*/i2MOEr' + dna[2] + '/*'
dnaMiddle = '*'.join(dna[3:-3])
dnaLast3Chars = '*/i2MOEr' + dna[-3] + '/*i2MOEr' + dna[-2] + '/*/32MOEr' + dna[-1] + '/'

dnaTransformed = dnaFirst3Chars + dnaMiddle + dnaLast3Chars

print(dnaTransformed)

Output:

/52MOErT/*/i2MOErG/*/i2MOErT/*A*C*G*T*T*G*C*T*C*C*/i2MOErG/*i2MOErA/*/32MOErC/

UPDATE:

For simplicity, you can transform the above code in a function like this:

def dna_transformation(dna):
    """ Takes a DNA string and returns the transformed DNA """

    dnaFirst3Chars = '/52MOEr' + dna[0] + '/*/i2MOEr' + dna[1] + '/*/i2MOEr' + dna[2] + '/*'
    dnaMiddle = '*'.join(dna[3:-3])
    dnaLast3Chars = '*/i2MOEr' + dna[-3] + '/*i2MOEr' + dna[-2] + '/*/32MOEr' + dna[-1] + '/'

    return dnaFirst3Chars + dnaMiddle + dnaLast3Chars

print(dna_transformation("TGTACGTTGCTCCGAC")) # call the function

Output: /52MOErT/*/i2MOErG/*/i2MOErT/*A*C*G*T*T*G*C*T*C*C*/i2MOErG/*i2MOErA/*/32MOErC/

codrelphi
  • 1,075
  • 1
  • 7
  • 13
0

Assuming there's a typo in your expected result and it should actually be /52MOErT/*/i2MOErG/*/i2MOErT/*A*C*G*T*T*G*C*T*C*C*/i2MOErG/*/i2MOErA/*/32MOErC/ the code below will work:

# python3
def encode_sequence(seq):
    seq_front = seq[:3]
    seq_back = seq[-3:]
    seq_middle = seq[3:-3]
    front_ix = ["/52MOEr{}/", "/i2MOEr{}/", "/i2MOEr{}/"]
    back_ix = ["/i2MOEr{}/", "/i2MOEr{}/", "/32MOEr{}/"]
    encoded = []
    for base, index in zip(seq_front, front_ix):
        encoded.append(index.format(base))
    encoded.extend(seq_middle)
    for base, index in zip(seq_back, back_ix):
        encoded.append(index.format(base))
    return "*".join(encoded)

Read through the code and make sure you understand it. Essentially we're just slicing the original string and inserting the bases into the format you need. Each element of the final output is added to a list and joined by the * character at the end.

If you need to dynamically specify the number and name of the bases you extract from the front and back of the sequence you can use this version. Note that the {} braces tell the string.format function where to insert the base.

def encode_sequence_2(seq, front_ix, back_ix):
    seq_front = seq[:len(front_ix)]
    seq_back = seq[-len(back_ix):]
    seq_middle = seq[len(front_ix):-len(back_ix)]
    encoded = []
    for base, index in zip(seq_front, front_ix):
        encoded.append(index.format(base))
    encoded.extend(seq_middle)
    for base, index in zip(seq_back, back_ix):
        encoded.append(index.format(base))
    return "*".join(encoded)

And here's the output:

> seq = "TGTACGTTGCTCCGAC"
> encode_sequence(seq)
/52MOErT/*/i2MOErG/*/i2MOErT/*A*C*G*T*T*G*C*T*C*C*/i2MOErG/*/i2MOErA/*/32MOErC/

If you have a list of sequences to encode you can iterate over the list and encode each:

encoded_list = []
for seq in dna_list:
    encoded_list.append(encode_sequence(seq))

Or with a list comprehension:

encoded_list = [encode_sequence(seq) for seq in dna_list)]
Evan
  • 603
  • 1
  • 11
  • 18
  • Thank you Evan. You are correct, I have fixed the typo, many apologies. I think am I able to understand how the code works. Is there a way to adopt this method for a list of sequences? Say for example I have a bunch DNA sequences and want to apply the algorithm. ```>>> dna = "'AAGTCTGGTTAACCAT AATACTAGGTAACTAC TGTACGTTGCTCCGTC TGTAGTTAGCTCCGTC" >>> dna_list = dna.split() >>> encode_sequence(dna_list) "/52MOEr'AAGTCTGGTTAACCAT/*/i2MOErAATACTAGGTAACTAC/*/i2MOErTGTACGTTGCTCCGTC/*/i2MOErAATACTAGGTAACTAC/*/i2MOErTGTACGTTGCTCCGTC/*/32MOErTGTAGTTAGCTCCGTC/"``` That is the output iI get. – ajs Nov 01 '19 at 19:20
  • You want a separate encoded string for each sequence, right? Try the example I added to the end of my answer. This will encode each sequence in your list and create a new list of the encoded sequences. Is that what you want? – Evan Nov 02 '19 at 21:47