You are interpreting a string of 4 different 'digits' as a number, so a base 4 notation. If you had a string of actual digits, in the range 0-3, you could have int()
produce an integer really fast.
def seq_to_int(seq, _m=str.maketrans('ACGT', '0123')):
return int(seq.translate(_m), 4)
The above function uses str.translate()
to replace each of the 4 characters with a matching digit (I used the static str.maketrans()
function to create the translation table). The resulting string of digits is then interpreted as an integer number in base 4.
Note that this produces an integer object, not binary string of zero and one characters:
>>> seq_to_int('TGTGAGAAGCACCATAAAAGGCGTTGTG')
67026852874722286
>>> format(seq_to_int('TGTGAGAAGCACCATAAAAGGCGTTGTG'), '016x')
'00ee20914c029bee'
>>> format(seq_to_int('TGTGAGAAGCACCATAAAAGGCGTTGTG'), '064b')
'0000000011101110001000001001000101001100000000101001101111101110'
No padding is needed here; as long as your input sequence is 32 letters or less, the resulting integer will fit in an unsigned 8-byte integer representation. In the above output examples, I used the format()
string to format that integer value as a hexadecimal and binary string, respectively, and zero-padded those representations to the correct number of digits for a 64-bit number.
To measure if this is faster, lets take 1 million randomly produced test strings (each 28 characters long):
>>> from random import choice
>>> testvalues = [''.join([choice('ATCG') for _ in range(28)]) for _ in range(10 ** 6)]
The above function can produce 1 million conversions in under 3/4 of a second on my Macbook Pro with 2.9 GHz Intel Core i7, on Python 3.6.5:
>>> from timeit import timeit
>>> timeit('seq_to_int(next(tviter))', 'from __main__ import testvalues, seq_to_int; tviter=iter(testvalues)')
0.7316284350017668
So that's 0.73 microseconds per call.
(previously, I advocated a pre-computation version, but after experimentation I struck on the base-4 idea).
To compare this to the other methods posted here so far, some need to be adjusted to produce integers too, and be wrapped into functions:
def seq_to_int_alexhall_a(seq, mapping={'A': b'00', 'C': b'01', 'G': b'10', 'T': b'11'}):
return int(b''.join(map(mapping.__getitem__, seq)), 2)
def seq_to_int_alexhall_b(seq, mapping={'A': b'00', 'C': b'01', 'G': b'10', 'T': b'11'}):
return int(b''.join([mapping[c] for c in seq]), 2)
def seq_to_int_jonathan_may(seq, mapping={'A': 0b00, 'C': 0b01, 'G': 0b10, 'T': 0b11}):
result = 0
for char in seq:
result = result << 2
result = result | mapping[char]
return result
And then we can compare these:
>>> testfunctions = {
... 'Alex Hall (A)': seq_to_int_alexhall_a,
... 'Alex Hall (B)': seq_to_int_alexhall_b,
... 'Jonathan May': seq_to_int_jonathan_may,
... # base_decode as defined in https://stackoverflow.com/a/50239330
... 'martineau': base_decode,
... 'Martijn Pieters': seq_to_int,
... }
>>> setup = """\
... from __main__ import testvalues, {} as testfunction
... tviter = iter(testvalues)
... """
>>> for name, f in testfunctions.items():
... res = timeit('testfunction(next(tviter))', setup.format(f.__name__))
... print(f'{name:>15}: {res:8.5f}')
...
Alex Hall (A): 2.17879
Alex Hall (B): 2.40771
Jonathan May: 3.30303
martineau: 16.60615
Martijn Pieters: 0.73452
The base-4 approach I propose easily wins this comparison.