5

I am trying to speed up Python programs using Rust, a language in which I am a total beginner. I wrote a function that counts the occurrences of each possible string of length n within a larger string. For instance, if the main string is "AAAAT" and n=3, the outcome would be a hashmap {"AAA":2,"AAT":1}. I use pyo3 to call the Rust function from Python. The code of the Rust function is:

fn count_nmers(seq: &str, n: usize) -> PyResult<HashMap<&str,u64>> {
    let mut current_pos: usize = 0;
    let mut counts: HashMap<&str,u64> = HashMap::new();
    while current_pos+n <= seq.len() {
        //print!("{}\n", &seq[current_pos..current_pos+n]);
        match counts.get(&seq[current_pos..current_pos+n]) {
            Some(repeats) => counts.insert(&seq[current_pos..current_pos+n],repeats+1),
            None => counts.insert(&seq[current_pos..current_pos+n],1)
        };
        current_pos +=1;
    }
    //print!("{:?}",counts)
    Ok(counts)
}

When I use small values for n (n<10), Rust is about an order of magnitude faster than Python, but as the length of n increases, the gap tends to zero with both functions having the same speed by n=200. (see graph) Times to count for different n-mer lengths (Python black, rust red)

I must be doing something wrong with the strings, but I can't find the mistake.


The python code is:

def nmer_freq_table(sequence,nmer_length=6):
    nmer_dict=dict()
    for nmer in seq_win(sequence,window_size=nmer_length):
        if str(nmer) in nmer_dict.keys():
            nmer_dict[str(nmer)]=nmer_dict[str(nmer)]+1
        else:
            nmer_dict[str(nmer)]=1
    return nmer_dict

def seq_win(seq,window_size=2):
    length=len(seq)
    i=0
    while i+window_size <= length:
        yield seq[i:i+window_size]
        i+=1

benchmark results graph

E_net4
  • 27,810
  • 13
  • 101
  • 139
  • Why do you expect Rust to be faster at that? And what are the Python codes? – Kelly Bundy Aug 18 '22 at 15:10
  • Rust is a compiled language. Did you turn optimizations on and make sure you're in release mode? Benchmarking is meaningless in debug mode – Silvio Mayolo Aug 18 '22 at 15:17
  • @KellyBundy Well, as much as I love Python, it is supposed to be slower. And for the small substrings it is indeed an order of magnitude slower. – Alberto Marin Sanguino Aug 18 '22 at 15:18
  • Yes, I compiled using cargo compile --release. The difference with the debug-unoptimized version is not that big, though. – Alberto Marin Sanguino Aug 18 '22 at 15:18
  • 2
    Two things: One, using the [`Entry` api](https://doc.rust-lang.org/std/collections/hash_map/struct.HashMap.html#method.entry) should cut down on memory accesses, and two, your Rust code could panic on arbitrary utf-8 strings. And just because something is compiled doesn't make it magically faster. This is almost exclusively random memory accesses, I don't see why Rust should be significantly faster at that than Python. – isaactfa Aug 18 '22 at 15:21
  • @isaactfa So the access to the HashMap is basically the bottleneck here and there is not much that we can do to speed it up... – Alberto Marin Sanguino Aug 18 '22 at 15:23
  • 3
    But with Python, the actual work is still done in C (assuming you use CPython), and when you make n larger, the fraction of the time spent on interpreting Python shrinks. With sufficiently large n, you're essentially *only* measuring its C speed, not its Python speed. – Kelly Bundy Aug 18 '22 at 15:26
  • @AlbertoMarinSanguino That seems to be more or less the case here. Have you profiled your code and determined this function to be a bottleneck? This task should be parallelizable quite nicely, maybe that could speed it up. – isaactfa Aug 18 '22 at 15:30
  • 1
    What happens with n larger than 200? Does Python become *faster* or do both remain equally fast? – Kelly Bundy Aug 18 '22 at 15:32
  • @KellyBundy Added longer n values to the benchmark. It seems the speedup is minimized exactly for the n i need :-) – Alberto Marin Sanguino Aug 18 '22 at 15:51
  • 1
    The Python code could btw also be quite a bit better: [code](https://tio.run/##XY7BDoIwDEDv@4re2AIeDIkxJB6Mv@DNGIKjgyXQjVIOfj0iUWJ819fXNj6lDZQfI8@z49CDDV2HVnygEXwfAwtcwkSCrFSNDqhHLh3jUEr16FCPOExIFrNVdEiNtKeDKRQsMMrEBLW3oj9b9CrefMubhwJ8@tPftxkXGDx4Aq6oQb3o7aCBHfw0kMLerJ0xSkX2JPr/1@S8cE0yyI2Z5xc) – Kelly Bundy Aug 18 '22 at 15:53
  • @KellyBundy You are absolutely right. I just have a lot of functions that work on a sliding window and never thought of optimizing them until now. With your answers and a few others, I start to see when to optimize my python code and when to go for a compiled language. – Alberto Marin Sanguino Aug 18 '22 at 15:56
  • @isaactfa I am still at the "Hello world!" stage with Rust, parallelizing is why I want to learn Rust, but I am not there yet. – Alberto Marin Sanguino Aug 18 '22 at 15:58
  • @AlbertoMarinSanguino Then optimization is the last thing I'd worry about. – isaactfa Aug 18 '22 at 16:00
  • @isaactfa Sure, I was just surprised by the similar performance of both functions. Specially because, as Kelly said, the Python code was not written to be particularly fast. – Alberto Marin Sanguino Aug 18 '22 at 16:05
  • Also, look here: https://stackoverflow.com/questions/70551997/faster-hashmap-for-sequential-keys. It behaves different if keys are sequential or – Evgeniy Aug 18 '22 at 16:05
  • Python's performance increase over the n-mer length range of [350, 400] is interesting. – SargeATM Aug 18 '22 at 16:05
  • @SargeATM Looks like it happened to be unusually slow at 350 once, maybe the machine was busy with other stuff for a moment. That bump doesn't appear in either Python version in the updated plot. – Kelly Bundy Aug 18 '22 at 16:15
  • @SargeATM Yeah, I am only running things once for each plot. All functions seem to have more or less the same speed except for Kelly´s Python version that is consistently faster. – Alberto Marin Sanguino Aug 18 '22 at 16:26
  • 1
    Btw I only converted the Counter to dict in order to match the original. If you're ok with a Counter, leaving that conversion out saves about 15% at n=200 in my test. – Kelly Bundy Aug 18 '22 at 16:32

2 Answers2

2

You are computing hash function multiple times, this may matter for large n values. Try using entry function instead of manual inserts:

while current_pos+n <= seq.len() {
    let en = counts.entry(&seq[current_pos..current_pos+n]).or_default();
    *en += 1;
    current_pos +=1;
}

Complete code here

Next, make sure you are running --release compiled code like cargo run --release.

And one more thing to take in mind is discussed here, Rust may use non-optimal hash function for your case which you can change.

And finally, on large data, most of time is spent in HashMap/dict internals which are not a python, but compiled code. So don't expect it to scale well.

Cerberus
  • 8,879
  • 1
  • 25
  • 40
Evgeniy
  • 2,481
  • 14
  • 24
  • Also Python caches string hash values which would explain its shallower slope on the performance graph compared to Rust. "The performance of hash algorithm shouldn't affect general benchmarks since _hash value is cached inside string object_." https://bugs.python.org/issue29410 – SargeATM Aug 18 '22 at 16:01
  • Interestingly Python and Rust use the same hash algorithm, SipHash 1-3. – SargeATM Aug 18 '22 at 16:07
2

Could it be because as n gets larger the number of iterations through the loop gets smaller?

Fewer iterations through the loop would reduce the performance gain seen by using Rust. I'm sure there is a small per function call performance cost for transition/marshaling to Rust from Python. This would explain how eventually the performance from pure Python and Python/Rust becomes the same.

SargeATM
  • 2,483
  • 14
  • 24
  • 1
    This seems the most likely to me. If you have a str of length `200` and an `n=200`, then both are going to do one iteration and any difference will vanish in the overhead. What happens when `l` (the `len(str)`) is very large? To do this analysis right, you should really be looking at speed compared to a ratio of `l:n`. – Nathaniel Ford Aug 18 '22 at 15:39
  • 1
    Not in this case. The difference between the main string of about 4 million characters and the substrings makes the reduction in number of loops negligible. – Alberto Marin Sanguino Aug 18 '22 at 15:42
  • @Evgeniy answer is a better answer after knowing the string sizes. – SargeATM Aug 18 '22 at 16:02