0

The complement to a typical string of DNA bases can be found by reversing the string then replacing each base with it's complement (A->T, T->A, C->G, G->C). For example, the complement of CTTAACCAGCGGACACGGGCTTGGC is GCCAAGCCCGTGTCCGCTGGTTAAG.

I wrote a couple algorithms to do this in javascript and am wondering if they are equivalent, i.e. equivalent time complexity for example, but also equivalent for any other considerations in this context of finding complements in a browser and updating an HTML document with the output. I am trying to understand Regexes and how javacript gets interpreted, compiled and executed.

First the script validates the input string s with this Regex: /[^CGAT]/. If the input string has a character besides a C, G, A, or T, the script stops and alerts the user why. Next, the script generates the reverse string with var reverseSeq=s.split('').reverse().join('') (which I found here).

Then either of these javascripts converts the reversed string to the original's complement.

1.)

var complementSeq='';
for (i in reverseSeq){complementSeq=complementSeq+COMPLEMENT_BASES[reverseSeq[i]]};

2.)

complementSeq=reverseSeq.replace(/(A)|(C)|(T)|(G)/g, function (match) {return COMPLEMENT_BASES[match]})

2) depends on this object: COMPLEMENT_BASES={'A':'T','T':'A','G':'C','C':'G'};.

Are these methods equally efficient, and basically equivalent otherwise? Is there any other approach I'm missing that might be more efficient?

Jesse W. Collins
  • 139
  • 1
  • 1
  • 12
  • 1
    The fastest way might be a direct replace all regex function. The engine doesn't return until all the replacements are made. Some engines offer in-line replacement. Not sure about JS though. –  Oct 25 '15 at 23:26

3 Answers3

3

I think, you should use the already existent array and map the complement value before joining it.

The overhead of replace and the use of regex is more than you really need here.

function complement(a) {
    return { A: 'T', T: 'A', G: 'C', C: 'G' }[a];
}    

var sequence = 'CTTAACCAGCGGACACGGGCTTGGC',
    complementSeq = sequence.split('').reverse().map(complement).join('');
document.write(complementSeq);
Nina Scholz
  • 376,160
  • 25
  • 347
  • 392
2

Well you could speed it a little with a simplified RegExp as you don't really need capture groups

var COMPLEMENT_BASES = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'},
    re = /[ATCG]/g;

var complementSeq = reverseSeq.replace(re, function ($0) {
    return COMPLEMENT_BASES[$0]
});

You could also halve+ the number of iterations by using more predefined. This is a linear speed improvement though. i.e.

var COMPLEMENT_BASES = {
     'A':  'T',  'T':  'A',  'G':  'C',  'C':  'G',
    'AA': 'TT', 'AT': 'TA', 'AC': 'TG', 'AG': 'TC',
    'TA': 'AT', 'TT': 'AA', 'TC': 'AG', 'TG': 'AC',
    'CA': 'GT', 'CT': 'GA', 'CC': 'GG', 'CG': 'GC',
    'GA': 'CT', 'GT': 'CA', 'GC': 'CG', 'GG': 'CC'
};

var re = /[ATCG]{1,2}/g;

If you will be generating a huge amount of this data, consider if a String of ATCGs is really the best option here. For example, is the following using Integers better?

__|_dec_|_bin_    <-- XOR -->    _bin_|_dec_|__
A |   0 |  00       (3) 11         11 |   3 | T
T |   3 |  11           11         00 |   0 | A
C |   1 |  01           11         10 |   2 | G
G |   2 |  10           11         01 |   1 | C

If you keep track of your sequence length you can now do this transformation in 32-bit XORs, so each XOR transforms 16 bases. The memory overhead is also reduced by 4.

In JavaScript, the XOR operator is ^, i.e. (0 ^ 3) === 3, (1 ^ 3) === 2; // true

Paul S.
  • 64,864
  • 9
  • 122
  • 138
1

The big-O efficiency of your algorithms is the same. All three of these algorithms are O(n) complexity (where n is the length of the input). I personally can't think of a way it could be any better than that, so it all comes down to profiling the code to see which is faster. However some things to consider:

  1. Regular expressions have a high initial compilation cost, so if this going to be called a lot, you might want to compile your regexes.
  2. Function calls can also be expensive in this context (they have to establish a stack and so on), so you may want to stay away from the regex replace because of this.
  3. For pure code legibility, the first method wins IMO (highly subjective I understand).

Another thing to consider might be striding over two or three letters at a time. Again, this would have to be profiled, but you could construct a lookup table based on two or three letters and their complements at a time. For something like this, the loop operation and string concatenation operations might be a pretty big overhead if the string length is large enough.

syazdani
  • 4,760
  • 1
  • 26
  • 35