Convert String to 64bit integer mapping characters to custom two-bit values mapping

Question

I am trying to map a string of characters (A, T, C, G) into a 64 bit integer where each letter is represented as two bits using this mapping:

mapping = {'A': 0b00, 'C': 0b01, 'G': 0b10, 'T': 0b11}

The "sequence" string will not be longer than 28 characters, and I intend for the integer to be zero-padded at the beginning to make it 64 bits. Currently, I use the below function, but it is incredibly slow. I then convert the output using by calling:

int(result, 2)

This currently works, but I'd like to make this function incredibly fast. I don't know C++ well so it's hard for me to port to that. I am trying Cython now but I am unfamiliar with that as well. Any help making this more efficient in Python (or even C++ or Cython equivalent) would be greatly appreciated.

Below is my code, again which I call int() on afterwards.

def seq_to_binary(seq):
    values = [mapping[c] for c in seq]
    BITWIDTH = 2
    return "".join(map(lambda x: bin(x)[2:].zfill(BITWIDTH), values)).encode();

At typical sequence input would be something like: 'TGTGAGAAGCACCATAAAAGGCGTTGTG'

Martijn Pieters · Answer 1 · 2018-05-08T18:04:57.270

You are interpreting a string of 4 different 'digits' as a number, so a base 4 notation. If you had a string of actual digits, in the range 0-3, you could have int() produce an integer really fast.

def seq_to_int(seq, _m=str.maketrans('ACGT', '0123')):
    return int(seq.translate(_m), 4)

The above function uses str.translate() to replace each of the 4 characters with a matching digit (I used the static str.maketrans() function to create the translation table). The resulting string of digits is then interpreted as an integer number in base 4.

Note that this produces an integer object, not binary string of zero and one characters:

>>> seq_to_int('TGTGAGAAGCACCATAAAAGGCGTTGTG')
67026852874722286
>>> format(seq_to_int('TGTGAGAAGCACCATAAAAGGCGTTGTG'), '016x')
'00ee20914c029bee'
>>> format(seq_to_int('TGTGAGAAGCACCATAAAAGGCGTTGTG'), '064b')
'0000000011101110001000001001000101001100000000101001101111101110'

No padding is needed here; as long as your input sequence is 32 letters or less, the resulting integer will fit in an unsigned 8-byte integer representation. In the above output examples, I used the format() string to format that integer value as a hexadecimal and binary string, respectively, and zero-padded those representations to the correct number of digits for a 64-bit number.

To measure if this is faster, lets take 1 million randomly produced test strings (each 28 characters long):

>>> from random import choice
>>> testvalues = [''.join([choice('ATCG') for _ in range(28)]) for _ in range(10 ** 6)]

The above function can produce 1 million conversions in under 3/4 of a second on my Macbook Pro with 2.9 GHz Intel Core i7, on Python 3.6.5:

>>> from timeit import timeit
>>> timeit('seq_to_int(next(tviter))', 'from __main__ import testvalues, seq_to_int; tviter=iter(testvalues)')
0.7316284350017668

So that's 0.73 microseconds per call.

(previously, I advocated a pre-computation version, but after experimentation I struck on the base-4 idea).

To compare this to the other methods posted here so far, some need to be adjusted to produce integers too, and be wrapped into functions:

def seq_to_int_alexhall_a(seq, mapping={'A': b'00', 'C': b'01', 'G': b'10', 'T': b'11'}):
    return int(b''.join(map(mapping.__getitem__, seq)), 2)

def seq_to_int_alexhall_b(seq, mapping={'A': b'00', 'C': b'01', 'G': b'10', 'T': b'11'}):
    return int(b''.join([mapping[c] for c in seq]), 2)

def seq_to_int_jonathan_may(seq, mapping={'A': 0b00, 'C': 0b01, 'G': 0b10, 'T': 0b11}):
    result = 0
    for char in seq:
        result = result << 2
        result = result | mapping[char]
    return result

And then we can compare these:

>>> testfunctions = {
...     'Alex Hall (A)': seq_to_int_alexhall_a,
...     'Alex Hall (B)': seq_to_int_alexhall_b,
...     'Jonathan May': seq_to_int_jonathan_may,
...     # base_decode as defined in https://stackoverflow.com/a/50239330
...     'martineau': base_decode,
...     'Martijn Pieters': seq_to_int,
... }
>>> setup = """\
... from __main__ import testvalues, {} as testfunction
... tviter = iter(testvalues)
... """
>>> for name, f in testfunctions.items():
...     res = timeit('testfunction(next(tviter))', setup.format(f.__name__))
...     print(f'{name:>15}: {res:8.5f}')
...
  Alex Hall (A):  2.17879
  Alex Hall (B):  2.40771
   Jonathan May:  3.30303
      martineau: 16.60615
Martijn Pieters:  0.73452

The base-4 approach I propose easily wins this comparison.

score 1 · Answer 2 · answered May 08 '18 at 20:57

My clumsy straight forward try in Cython, which is twice as fast as the best solution (@MartijnPieters's) so far:

%%cython

ctypedef unsigned long long ull

cdef ull to_int(unsigned char *data, int n):
    cdef ull res=0
    cdef int i
    cdef unsigned char ch
    for i in range(n):
        res<<=2
        ch=data[i]
        if ch==67: #C
            res+=1
        if ch==71: #G
            res+=2
        if ch==84: #T
            res+=3
    return res

cpdef str_to_int_ead(str as_str):
    s=as_str.encode('ascii')
    return to_int(s, len(s))

Compared to current @MartijnPieters's solution, it is twice as fast on my machine:

>>> [str_to_int_ead(x) for x in testvalues] == [seq_to_int(x) for x in testvalues]
True

>>> tviter=iter(testvalues)
>>> %timeit -n1000000 -r1 seq_to_int(next(tviter))
795 ns ± 0 ns per loop (mean ± std. dev. of 1 run, 1000000 loops each)

>>> tviter=iter(testvalues)
>>> %timeit -n1000000 -r1 str_to_int_ead(next(tviter))
363 ns ± 0 ns per loop (mean ± std. dev. of 1 run, 1000000 loops each)

That makes 0.795 seconds vs 0.363 seconds for the whole run (so it can be compared with timings measured by @MartijnPieters).

One coould ask, how many overhead can be saved, if the conversion unicode <-> ascii were not needed?

%%cython
....
cpdef bytes_to_int_ead(bytes as_bytes):
    return to_int(as_bytes, len(as_bytes))


>>> testbytes=[bytes(x.encode('ascii')) for x in testvalues]
>>> tviter=iter(testbytes)
>>> %timeit -n1000000 -r1 bytes_to_int_ead(next(tviter))
327 ns ± 0 ns per loop (mean ± std. dev. of 1 run, 1000000 loops each)

Only 10% faster - this is somewhat surprising...

However, we should not forget we also measuring the overhead of "nexting" an iterator, without we get:

>>> v=testvalues[0]
>>> %timeit str_to_int_ead(v)
>>> 139 ns ± 0.628 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


>>> v=testbytes[0]
>>> %timeit bytes_to_int_ead(v)
97.2 ns ± 1.03 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Now there is actually about 40% speed-up now!

Another interesting conclusion: There are also about 250ns (or 70%) overhead when tested with iterators. Without this overhead, cython beats @MartijnPieters's 140ns vs 550ns, i.e. by almost by factor 4.

Listing function the cython have been compared to (current state of @MartijnPieters's answer):

def seq_to_int(seq, _m=str.maketrans('ACGT', '0123')):
    return int(seq.translate(_m), 4)

test data:

from random import choice
testvalues = [''.join([choice('ATCG') for _ in range(28)]) for _ in range(10 ** 6)]

Alex Hall · Answer 3 · 2018-05-08T15:58:38.703

seq = 'TGTGAGAAGCACCATAAAAGGCGTTGTG'

mapping = {'A': b'00', 'C': b'01', 'G': b'10', 'T': b'11'}

result = b''.join(map(mapping.__getitem__, seq)).zfill(64)

print(result)

Here is some timing code to compare options:

import timeit

setup = """
seq = 'TGTGAGAAGCACCATAAAAGGCGTTGTG'

mapping = {'A': b'00', 'C': b'01', 'G': b'10', 'T': b'11'}
"""

for stmt in [
    "b''.join(map(mapping.__getitem__, seq)).zfill(64)",
    "b''.join([mapping[c] for c in seq]).zfill(64)",
]:
    print(stmt)
    print(timeit.timeit(stmt, setup, number=10000000))

I find that the two options are roughly the same, but your results may vary.

score 0 · Answer 4 · answered May 08 '18 at 16:26

Use the bit shift operator and addition. You've got the right idea with using a dictionary to hold character codes:

mapping = {'A': 0b00, 'C': 0b01, 'G': 0b10, 'T': 0b11}

Produce a 28 character string (kind of redundant to call it this, string will do) for this example:

chars = 'TGTGAGAAGCACCATAAAAGGCGTTGTG'

Define a result and set it to zero:

result = 0

Strings in Python are actually just an array of characters, and you can iterate through the string as you would any array. We'll use this, along with a nested series of bit operations to do what you require:

for char in chars:
  result = result << 2
  result = result | mapping[char]

This will produce bits of length 2*len(chars) which in this case is 56. To get the extra

To add an extra 8 bits of leading zeros, the integer representation of this is actually a QWORD (64 bits) and will automatically fill the 8 Most Significant Bits with zeros.

print(result)
>> 67026852874722286

If you wanted to get really fancy, you could use ctypes to speed up your code.

score 0 · Answer 5 · answered May 08 '18 at 17:36

One way to think about this problem is to realize that the essence of what it's doing is a conversion from a base 4 number to a base 10. This can be done a number of ways, but one I like is the actually very generic accepted answer to the question Base 62 conversion.

Below is a modified version of it the does a base 4 conversion by default:

def base_decode(astring, alphabet="ACGT"):
    """Decode a Base X encoded astring into the number

    Arguments:
    - `astring`: The encoded astring
    - `alphabet`: The alphabet to use for encoding
    """
    base = len(alphabet)
    strlen = len(astring)
    num = 0
    for idx, char in enumerate(astring):
        power = (strlen - (idx + 1))
        num += alphabet.index(char) * (base ** power)

    return num

seq = 'TGTGAGAAGCACCATAAAAGGCGTTGTG'
print('seq_to_binary:', seq_to_binary(seq))
print('base_decode:', format(base_decode(seq), 'b'))

Note that this actually returns an integer of whatever bit length is needed (integers are variable length in Python) to store the number given as a character string packed in to binary integer value. The added call to format() turns that value into a binary string so it can be printed and compared to the result of calling your seq_to_binary() function which returns a string, not a 64-bit integer mentioned in the title.

Base-4 conversion can be done much, much faster though; `int()` can do this is C for us, all that is needed is a string translation from letters to digits. — Martijn Pieters, May 08 '18 at 17:58
@Martijn: Realizing and pointing out that this was simply a base-4 conversion problem was the primary point of my answer, not the particular implementation of it presented—which I picked because it was handy, easy to understand, and easily adapted to do base-4 conversions. Mapping the characters of the string to "normal" digits and using the `int()` built-in is very clever and a faster implementation. Congratulations. — martineau, May 08 '18 at 18:42

Convert String to 64bit integer mapping characters to custom two-bit values mapping

5 Answers5