I am trying to speed up Python programs using Rust, a language in which I am a total beginner. I wrote a function that counts the occurrences of each possible string of length n
within a larger string. For instance, if the main string is "AAAAT"
and n=3
, the outcome would be a hashmap {"AAA":2,"AAT":1}
. I use pyo3 to call the Rust function from Python. The code of the Rust function is:
fn count_nmers(seq: &str, n: usize) -> PyResult<HashMap<&str,u64>> {
let mut current_pos: usize = 0;
let mut counts: HashMap<&str,u64> = HashMap::new();
while current_pos+n <= seq.len() {
//print!("{}\n", &seq[current_pos..current_pos+n]);
match counts.get(&seq[current_pos..current_pos+n]) {
Some(repeats) => counts.insert(&seq[current_pos..current_pos+n],repeats+1),
None => counts.insert(&seq[current_pos..current_pos+n],1)
};
current_pos +=1;
}
//print!("{:?}",counts)
Ok(counts)
}
When I use small values for n
(n<10
), Rust is about an order of magnitude faster than Python, but as the length of n increases, the gap tends to zero with both functions having the same speed by n=200
. (see graph)
Times to count for different n-mer lengths (Python black, rust red)
I must be doing something wrong with the strings, but I can't find the mistake.
The python code is:
def nmer_freq_table(sequence,nmer_length=6):
nmer_dict=dict()
for nmer in seq_win(sequence,window_size=nmer_length):
if str(nmer) in nmer_dict.keys():
nmer_dict[str(nmer)]=nmer_dict[str(nmer)]+1
else:
nmer_dict[str(nmer)]=1
return nmer_dict
def seq_win(seq,window_size=2):
length=len(seq)
i=0
while i+window_size <= length:
yield seq[i:i+window_size]
i+=1