Reverse complement of DNA strand using Python

Question

I have a DNA sequence and would like to get reverse complement of it using Python. It is in one of the columns of a CSV file and I'd like to write the reverse complement to another column in the same file. The tricky part is, there are a few cells with something other than A, T, G and C. I was able to get reverse complement with this piece of code:

def complement(seq):
    complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'} 
    bases = list(seq) 
    bases = [complement[base] for base in bases] 
    return ''.join(bases)
    def reverse_complement(s):
        return complement(s[::-1])

    print "Reverse Complement:"
    print(reverse_complement("TCGGGCCC"))

However, when I try to find the item which is not present in the complement dictionary, using the code below, I just get the complement of the last base. It doesn't iterate. I'd like to know how I can fix it.

def complement(seq):
    complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'} 
    bases = list(seq) 
    for element in bases:
        if element not in complement:
            print element  
        letters = [complement[base] for base in element] 
        return ''.join(letters)
def reverse_complement(seq):
    return complement(seq[::-1])

print "Reverse Complement:"
print(reverse_complement("TCGGGCCCCX"))

What do you want to be the complement of items not present in the dictionary? The original item itself? — aa333, Aug 07 '14 at 17:55
@aa333 There are some values like ins and dup, I'd like to print them as it is. I've tried using Bio.Seq but it converts 'ins' to 'sni' while reverse complementing. — user3783999, Aug 07 '14 at 17:57
first replace `ins` and other "bases" with single letter substitutes, reverse, then put them back — Gabriel, Aug 07 '14 at 18:02
@Gabriel is there a Pythonic way to do that, since I'm afraid I can't do it manually. Thank you — user3783999, Aug 07 '14 at 18:08

score 39 · Answer 1 · edited Aug 22 '20 at 18:03

The other answers are perfectly fine, but if you plan to deal with real DNA sequences I suggest using Biopython. What if you encounter a character like "-", "*" or indefinitions? What if you want to do further manipulations of your sequences? Do you want to create a parser for each file format out there?

The code you ask for is as easy as:

from Bio.Seq import Seq

seq = Seq("TCGGGCCC")

print seq.reverse_complement()
# GGGCCCGA

Now if you want to do another transformations:

print seq.complement()
print seq.transcribe()
print seq.translate()

Outputs

AGCCCGGG
UCGGGCCC
SG

And if you run into strange chars, no need to keep adding code to your program. Biopython deals with it:

seq = Seq("TCGGGCCCX")
print seq.reverse_complement()
# XGGGCCCGA

Jason S · Answer 2 · 2014-08-07T18:23:50.517

20

In general, a generator expression is simpler than the original code and avoids creating extra list objects. If there can be multiple-character insertions go with the other answers.

complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
seq = "TCGGGCCC"
reverse_complement = "".join(complement.get(base, base) for base in reversed(seq))

edited Aug 07 '14 at 18:23

answered Aug 07 '14 at 18:04

Jason S

13,538
2
37
42

why do you need `(base, base)` two times here? – Igor Barinov Aug 16 '15 at 04:30
7

@IgorBarinov: The `get` function has a default return type of `None` for when the key is not available. Setting `base` as the default return type avoids a `KeyError` and just inserts the value for `base`. The effect is that unusual characters in the sequence are now handled (correctly). – Steve Feb 28 '17 at 08:09

score 16 · Answer 3 · answered Oct 28 '14 at 18:35

16

import string
old_chars = "ACGT"
replace_chars = "TGCA"
tab = string.maketrans(old_chars,replace_chars)
print "AAAACCCGGT".translate(tab)[::-1]

that will give you the reverse compliment = ACCGGGTTTT

answered Oct 28 '14 at 18:35

Nathan M

171
1
6

4

Best answer for me. Note it needs to be adapted for python3: replace `string` by `str` (no need to import) and obviously add parentheses to `print`. – Jean Paul Nov 26 '18 at 16:39
answered Oct 28 '14 at 18:35 - At the time that was correct ... yes, you are right. Great observation. – Nathan M Nov 30 '18 at 21:44

Gabriel · Accepted Answer · 2014-08-07T18:25:28.583

5

The get method of a dictionary allows you to specify a default value if the key is not in the dictionary. As a preconditioning step I would map all your non 'ATGC' bases to single letters (or punctuation or numbers or anything that wont show up in your sequence), then reverse the sequence, then replace the single letter alternates with their originals. Alternatively, you could reverse it first and then search and replace things like sni with ins.

alt_map = {'ins':'0'}
complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'} 

def reverse_complement(seq):    
    for k,v in alt_map.iteritems():
        seq = seq.replace(k,v)
    bases = list(seq) 
    bases = reversed([complement.get(base,base) for base in bases])
    bases = ''.join(bases)
    for k,v in alt_map.iteritems():
        bases = bases.replace(v,k)
    return bases

>>> seq = "TCGGinsGCCC"
>>> print "Reverse Complement:"
>>> print(reverse_complement(seq))
GGGCinsCCGA

edited Aug 07 '14 at 18:25

answered Aug 07 '14 at 17:57

Gabriel

10,524
1
23
28

thank you for the answer. I cannot hard code the element 'alt' because it can be anything. There's no definite structure for it. – user3783999 Aug 07 '14 at 18:00
updated the answer to work w/o knowing what the the other bases will be – Gabriel Aug 07 '14 at 18:03
2

Just use `complement.get(base,base)`. – aa333 Aug 07 '14 at 18:03
If you want to publish this somewhere, beware [Ambiguity codes](http://reverse-complement.com/ambiguity.html) – Audrius Meškauskas Aug 18 '16 at 19:20
A fastest and clearer way to reverse a string: `your_string[::-1]` – Mastodon Apr 18 '17 at 22:16

score 4 · Answer 5 · answered Feb 05 '20 at 06:49

4

The fastest one liner for reverse complement is the following:

def rev_compl(st):
    nn = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
    return "".join(nn[n] for n in reversed(st))

answered Feb 05 '20 at 06:49

alphahmed

97
4

Best answer, for pure Python – Can H. Tartanoglu May 12 '21 at 16:49

score 2 · Answer 6 · answered Jun 21 '17 at 18:38

def ReverseComplement(Pattern):
    revcomp = []
    x = len(Pattern)
    for i in Pattern:
        x = x - 1
        revcomp.append(Pattern[x])
    return ''.join(revcomp)

# this if for the compliment 

def compliment(Nucleotide):
    comp = []
    for i in Nucleotide:
        if i == "T":
            comp.append("A")
        if i == "A":
            comp.append("T")
        if i == "G":
            comp.append("C")
        if i == "C":
            comp.append("G")

    return ''.join(comp)

score 2 · Answer 7 · edited Sep 23 '19 at 11:51

2

Give a try to below code,

complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
seq = "TCGGGCCC"
reverse_complement = "".join(complement.get(base, base) for base in reversed(seq))

edited Sep 23 '19 at 11:51

Kiran Maniya

8,453
9
58
81

answered Sep 23 '19 at 10:10

Akansha Rana

21
2

Add code formatting to your post. Also, consider adding any explanation to your solution. – Pochmurnik Sep 23 '19 at 10:33

msim · Answer 8 · 2022-02-22T09:48:59.947

1

Considering also degenerate bases:

def rev_compl(seq):
    BASES ='NRWSMBDACGTHVKSWY'
    return ''.join([BASES[-j] for j in [BASES.find(i) for i in seq][::-1]])

edited Feb 22 '22 at 09:48

answered Jun 06 '21 at 08:30

msim

11
3

Please note that `str.find()` is very dangerous in this case, because any base not in `BASES` will return `j` as `-1` (and `BASES[-j]` as `R`). Please use `str.index()` instead. – Feng Tian Aug 24 '22 at 09:24

score 0 · Answer 9 · answered Dec 22 '21 at 00:17

0

This may be the quickest way to complete a reverse compliment:

def complement(seq):
    complementary = { 'A':'T', 'T':'A', 'G':'C','C':'G' }
    return ''.join(reversed([complementary[i] for i in seq]))

answered Dec 22 '21 at 00:17

Get_Richie

41
6

asgaines · Answer 10 · 2023-02-10T16:42:46.493

0

Using the timeit module for speed profiling, this is the fastest algorithm I came up with with my coworkers for sequences < 200 nucs:

sequence \
    .replace('A', '*') \  # Temporary symbol
    .replace('T', 'A') \
    .replace('*', 'T') \
    .replace('C', '&') \  # Temporary symbol
    .replace('G', 'C') \
    .replace('&', 'G')[::-1]

edited Feb 10 '23 at 16:42

answered Feb 10 '23 at 16:30

asgaines

197
1
3
12

Reverse complement of DNA strand using Python

10 Answers10

Linked