0

How we could code the reverse complete of a DNA sequence from its code?

A DNA sequence can contain 4 different characters A, C, G, T; where A is the complement of T and C is the complement of G.

A reverse complement of A DNA sequence is the complement of a sequence but in an inverse way (we compute the complement of each character from right to left).

Example: the complement of (AA) is: TT, the complement of (AC) is GT and so on...

In general, using python we code a sequence by mapping each character to a number going from 0 to 3,

    {A:0, C:1, G:2, T:3}

then the coding of AA is: 0, the coding of AC is:

    AC = 0*4^0+1*4^1 = 4

the coding of GT is:

    GT = 2*4^0+3*4^1 = 14 

How could I transform the code of each sequence to its reverse complement in python without creating a dictionary? For the above example: convert 4 to 14? and 0 to 15 ...

Cecile
  • 93
  • 10
  • can you not just use bit masks? It sounds like it would probably be easier to just precompute and store them in a dictionary. – beoliver Aug 10 '17 at 21:03
  • does this help you clear the thing a little or give you some hint? https://stackoverflow.com/questions/1604464/twos-complement-in-python#9147327 and I agree with @beoliver, why not use a bit mask? – OLIVER.KOO Aug 10 '17 at 21:04
  • Are you copying snippets from a textbook or something? We haven't been given an example, and rounding an integer doesn't make sense. Do they mean rotation? If so they're being strangely specific as that would only work for two symbols, and it would make more sense to say you're swapping them than shifting either direction. – Yann Vernier Aug 10 '17 at 21:04
  • We cannot use a dictionary, because the number of characters will an input variable for the programme. Here, I gave a simple example with 2 characters but if we work with 10 or more, we will a problem with the computation complexity. – Cecile Aug 10 '17 at 21:07
  • 2
    I suggest you take a step back from specific operations and explain to us 1) what a "reverse complete" of a sequence is, and 2) what format the sequences are stored as. From this description I can only guess that pairs of bits might map to the 4 letters, but not which or why they relate to each other. – Yann Vernier Aug 10 '17 at 21:13
  • Also guessing this may have to do with complementing, but not at all with two's complement (wherein 0 maps to 0). – Yann Vernier Aug 10 '17 at 21:22
  • If a have a DNA sequence of some characters, its reverse complement will be the complement of each character but in an inverse way. To be clearer, given a sequence of 6 characters: ACCGTT, the complement of A is: T, and the complement of C is G; so the complement of ACCGTT is: AACGGT. In general, we create a hash table where A maps to 0, C to 1, G to 2 and T to 3. For this, if we have: AA (0), its complement will be TT (15), if we have AC (4), its complement will be GT (14), and so on ... – Cecile Aug 10 '17 at 21:24
  • We read a sequence from left to right, but when computing its reverse, we should give the complement of each character from right to left. Hope the idea is clear now ... – Cecile Aug 10 '17 at 21:31
  • 1
    Possible duplicate of [Reverse complement of DNA strand using Python](https://stackoverflow.com/questions/25188968/reverse-complement-of-dna-strand-using-python) – beoliver Aug 11 '17 at 06:16

2 Answers2

0

the reverse of a list in python

>>> xs = [1,2,3]
>>> reversed(xs)
<listreverseiterator object at 0x10089c9d0>
>>> list(reversed(xs))
[3, 2, 1]
>>>

def complement(x):
   return ~x & 15 # as 15 == int('1111', 2) 

the 15 is a bitmask. It represents the binary 1111. We then use the binary and operator.

>>> "{0:b}".format(complement(int('1111',2)))
'0'
>>> "{0:b}".format(complement(int('0001',2)))
'1110'
>>> "{0:b}".format(complement(int('1001',2)))
'110'

>>> xs = [int('1111',2), int('1001',2), int('0110',2), int('1011',2)]
>>> map(complement, xs)
[0, 6, 9, 4]
>>> list(reversed(map(complement, xs)))
[4, 9, 6, 0]

Basing your example where

given a sequence of 6 characters: ACCGTT, the complement of A is: T, and the complement of C is G; so the reverse complement of ACCGTT is: AACGGT.

assume that you have c complemnt function complement and a reverse function reverse.

we have reverse(ACCGTT) = TTGCCA and complement(ACCGTT) = TGGCAA . Reversing a list after calling a function on each element is the same as calling a function on each element on a list.

complement(reverse(ACCGTT)) = reverse(complement(ACCGTT))

So the other part of the question is that you want to map

{A:0, C:1, G:2, T:3}
A -> T | 0 -> 3
T -> A | 3 -> 0
C -> G | 1 -> 2
G -> C | 2 -> 1

which in binary would be

a = int('00', 2) # 0
c = int('01', 2) # 1
g = int('10', 2) # 2
t = int('11', 2) # 3

def complement(x):
    return ~x & 3 # this 3 is the same as int('11', 2)

def reverse_complement(list_of_ints):
    return list(reversed(map(complement, list_of_ints)))
beoliver
  • 5,579
  • 5
  • 36
  • 72
  • In your example, you just give the reverse of each number. For my question, I need the following conversion, in case that we have only 2 characters: 0--15, 1--11, 2 -- 7, 3 --3, 4 -- 14, and so on.., if we have 3 characters: 0 -- 63, 16 -- 62, 48 -- 60 ,.. – Cecile Aug 10 '17 at 21:35
  • Cecile, how is the function supposed to know how many characters are in a number? You have 0 mapping to an arbitrarily large set of values here. – Yann Vernier Aug 10 '17 at 21:47
  • Yann Vernier, the number of characters is an input variable for the program. For example, if the input number is 2, we could have from AA (0) to TT(15). If we have 3 as input, we can get from AAA (0) to TTT (63). But how many characters in a number ? This was my question, and I tried to look for the binary code of each number in the hoping of finding some relations between a number and its reverse complement ... – Cecile Aug 10 '17 at 22:01
  • 1
    @Cecile if AA = 0 and AAA = 0 then you can have an infinite number of A's = 0. Where are you getting this information from? – beoliver Aug 10 '17 at 22:03
  • @beoliver, an AA's=0, this coding is used in DNA and RNA sequences. (https://www.researchgate.net/publication/224570529_Numerical_representation_of_DNA_sequences) – Cecile Aug 10 '17 at 22:17
  • @beoliver, sorry !! the reverse complement of ACCGTT is: AACGGT. – Cecile Aug 10 '17 at 22:22
  • That paper lists at least 11 representations of individual symbols, and the one which uses the integers from 0 to 3 doesn't match the order you've written (it uses T=0). – Yann Vernier Aug 10 '17 at 22:28
  • @Cecile, it should still work. the reverse of a complement is just the same as the complement of a reverse. – beoliver Aug 10 '17 at 22:34
0

Your symbol set is too small for a hash map to actually be efficient. And mixing two's complement into your problem has just caused confusion.

symbols = 'ACGT'
complements = symbols[::-1]   # reverse order
import string
table = string.maketrans(symbols, complements)
sample = 'ACCGTT'
print(sample[::-1].translate(table))
# output: AACGGT

Converting to some bitpacked format would take less space but require a lot more special handling, as you'd need to track sizes separately, perform arbitrarily wide shifts and so on. Python can certainly do it, in particular with int() accepting many bases and creating arbitrary width results, but it's likely a counterproductive detour.

digits = string.digits[:len(symbols)]
length = len(sample)
digitmap = string.maketrans(symbols, digits)
number = int(sample.translate(digitmap), len(digits))

def reversemapnumber(function=id, number=0, radix=0b100, length=0):
    result = 0
    for i in range(length):
        number,digit = divmod(number, radix)
        result = result*radix + function(digit)
    return result
revcomplemented = reversemapnumber(function=lambda x: 3-x,
        number=number, length=length)
# binary form
print('{:0{}b}'.format(revcomplemented, length*2))
# back to text form
print(''.join(symbols[(revcomplemented>>i)&0b11]
    for i in range(2*length-2, -2, -2)))

In that jumble of code I've used division rather than shifts to be somewhat more generic (supporting radix not a power of two), but the printing examples rely on the width exactly. In the end it's just tricky and unclear.

Yann Vernier
  • 15,414
  • 2
  • 28
  • 26