Fastest way to generate a kmer count vector from a nucleotide sequence (Julia)

Question

Given a nucleotide sequence, I'm writing some Julia code to generate a sparse vector of (masked) kmer counts, and I would like it to run as fast as possible.

Here is my current implementation,

using Distributions
using SparseArrays

function kmer_profile(seq, k, mask)
    basis = [4^i for i in (k - 1):-1:0]

    d = Dict('A'=>0, 'C'=>1, 'G'=>2, 'T'=>3)

    kmer_dict = Dict{Int, Int32}(4^k=>0)

    for n in 1:(length(seq) - length(mask) + 1)
        kmer_hash = 1
        j = 1
        for i in 1:length(mask)
            if mask[i]
                kmer_hash += d[seq[n+i-1]] * basis[j]
                j += 1
            end
        end
        haskey(kmer_dict, kmer_hash) ? kmer_dict[kmer_hash] += 1 : kmer_dict[kmer_hash] = 1
    end

    return sparsevec(kmer_dict)
end

seq = join(sample(['A','C','G','T'], 1000000))

mask_str = "111111011111001111111111111110"

mask = BitArray([parse(Bool, string(m)) for m in split(mask_str, "")])
k = sum(mask)

@time kmer_profile(seq, k, mask)

This code runs in about 0.3 seconds on my M1 MacBook Pro, is there any way to make it run significantly faster?

The function kmer_profile uses a sliding window of size length(mask) to count the number of times each masked kmer appears in the nucleotide sequence. A mask is a binary sequence, and a masked kmer is a kmer with nucleotides dropped at positions at which the mask is zero. E.g. the kmer ACGT and mask 1001 will produce the masked kmer AT.

To produce the kmer hash, the function treats each kmer as a base 4 number and then converts it to a (base 10) 64-bit integer, for indexing into the kmer vector.

The size of k is equal to the number of ones in the mask string, and is implicitly limited to 31 so that kmer hashes can fit into a 64-bit integer type.

why do you think it's possible to make it significantly faster? the main bottleneck is just the Dict() use, if somehow you can avoid growing the dictionary dynamically it can be fatser — jling, Jan 28 '23 at 01:00
@jling It turns out this is possible, and `sparsevec(kmer_dict)` surprisingly takes a big part of the execution time compared to dictionary operations. — Jérôme Richard, Jan 28 '23 at 01:37

score 3 · Answer 1 · answered Jan 28 '23 at 01:34

There are several possible optimizations to make this code faster.

First of all, one can convert the Dict to an array since array-based indexing is faster than dictionary-based indexing one and this is possible here since the key is an ASCII character.

Moreover, the extraction of the sequence codes can be done once instead of length(mask) times by pre-computing code and putting the result in a temporary array.

Additionally, the mask-based conditional and the loop carried dependency make things slow. Indeed, the condition cannot be (easily) predicted by the processor causing it to stall for several cycles. The loop carried dependency make things even worse since the processor can hardly execute other instructions during this stall. This problem can be solved by pre-computing the factors based on both mask and basis. The result is a faster branch-less loop.

Once the above optimizations are done, the biggest bottleneck is sparsevec. In fact, it was also taking nearly half the time of the initial implementation! Optimizing this step is difficult but not impossible. It is slow because of random accesses in the Julia implementation. One can speed this up by sorting the keys-values pairs in the first place. It is faster due to a more cache-friendly execution and it can also help the prediction unit of the processor. This is a complex topic. For more details about how this works, please read Why is processing a sorted array faster than processing an unsorted array?.

Here is the final optimized code:

function kmer_profile_opt(seq, k, mask)
    basis = [4^i for i in (k - 1):-1:0]

    d = zeros(Int8, 128)
    d[Int64('A')] = 0
    d[Int64('C')] = 1
    d[Int64('G')] = 2
    d[Int64('T')] = 3

    seq_codes = [d[Int8(e)] for e in seq]

    j = 1
    premult = zeros(Int64, length(mask))
    for i in 1:length(mask)
        if mask[i]
            premult[i] = basis[j]
            j += 1
        end
    end

    kmer_dict = Dict{Int, Int32}(4^k=>0)

    for n in 1:(length(seq) - length(mask) + 1)
        kmer_hash = 1
        j = 1
        for i in 1:length(mask)
            kmer_hash += seq_codes[n+i-1] * premult[i]
        end
        haskey(kmer_dict, kmer_hash) ? kmer_dict[kmer_hash] += 1 : kmer_dict[kmer_hash] = 1
    end

    sorted_kmer_pairs = sort(collect(kmer_dict))
    sorted_kmer_keys = [e[1] for e in sorted_kmer_pairs]
    sorted_kmer_values = [e[2] for e in sorted_kmer_pairs]
    return sparsevec(sorted_kmer_keys, sorted_kmer_values)
end

This code is a bit more than twice faster than the initial implementation on my machine. A significant fraction of the time is still spent in the sorting algorithm.

The code can still be optimized further. One way is to use a parallel sort algorithm. Another way is to replace the premult[i] multiplication by a shift which is faster assuming premult[i] is modified so to contain exponents. I expect the code to be about 4 times faster than the original code. The main bottleneck should be the big dictionary creation. Improving further the performance of this is very hard (though it is still possible).

Thanks for the detailed write up! I decided to accept Dan's answer as it was the version I went with, even though it was based on your initial work! — Set, Jan 28 '23 at 22:32

Dan Getz · Accepted Answer · 2023-01-28T12:38:45.223

Inspired by Jérôme's answer, and squeezing some more by avoiding Dicts altogether:

function kmer_profile_opt3a(seq, k, mask)
    d = zeros(Int8, 128)
    d[Int64('A')] = 0
    d[Int64('C')] = 1
    d[Int64('G')] = 2
    d[Int64('T')] = 3
    seq_codes = [d[Int8(e)] for e in seq]
           
    basis = [4^i for i in (k-1):-1:0]
    j = 1
    premult = zeros(Int64, length(mask))
    for i in 1:length(mask)
        if mask[i]
            premult[i] = basis[j]
            j += 1
        end
    end

    kmer_vec = Vector{Int}(undef, length(seq)-length(mask)+1)
    @inbounds for n in 1:(length(seq) - length(mask) + 1)
        kmer_hash = 1
        for i in 1:length(mask)
            kmer_hash += seq_codes[n+i-1] * premult[i]
        end
        kmer_vec[n] = kmer_hash
    end
    sort!(kmer_vec)
    return sparsevec(kmer_vec, ones(length(kmer_vec)), 4^k, +)
end

This achieved another 2x over Jérôme's answer on my machine. The auto-combining feature of sparsevec makes the code a bit more compact.

Trying to slim the code further, and avoid unnecessary allocations in sparse vector creation, the following can be used:

using SparseArrays, LinearAlgebra

function specialsparsevec(nzs, n)
    vals = Vector{Int}(undef, length(nzs))

    j, k, count, last = (1, 1, 0, nzs[1])
    while k <= length(nzs)
        if nzs[k] == last
            count += 1
        else
            vals[j], nzs[j] = (count, last)
            count, last = (1, nzs[k])
            j += 1
        end
        k += 1
    end
    vals[j], nzs[j] = (count, last)
    resize!(nzs, j)
    resize!(vals, j)
    return SparseVector(n, nzs, vals)
end

function kmer_profile_opt3(seq, k, mask)
    d = zeros(Int8, 128)
    foreach(((i,c),) -> d[Int(c)]=i-1, enumerate(collect("ACGT")))
    seq_codes = getindex.(Ref(d), Int8.(collect(seq)))
         
    premult = foldr(
      (i,(p,j))->(mask[i] && (p[i]=j ; j<<=2) ; (p,j)),
      1:length(mask); init=(zeros(Int64,length(mask)),1)) |> first
           
    kmer_vec = sort(
      [ dot(@view(seq_codes[n:n+length(mask)-1]),premult) + 1 for 
        n in 1:(length(seq)-length(mask)+1) 
      ])

    return specialsparsevec(kmer_vec, 4^k)
end

This last version gets another 10% speedup (but is a little cryptic):

julia> @btime kmer_profile_opt($seq, $k, $mask);
  367.584 ms (81 allocations: 134.71 MiB)  # other answer

julia> @btime kmer_profile_opt3a($seq, $k, $mask);
  140.882 ms (22 allocations: 54.36 MiB)   # 1st this answer

julia> @btime kmer_profile_opt3($seq, $k, $mask);
  127.016 ms (14 allocations: 27.66 MiB)   # 2nd this answer

Fastest way to generate a kmer count vector from a nucleotide sequence (Julia)

2 Answers2