Playing with rust
iterators and lambdas I've implemented N
prime numbers generator (original playground / improved code.).
It performs very poorly (10 times slower) compared to equivalent functional code in other language I'm more proficient with on the same box.
Some numbers:
test calc_primes ... bench: 7,754,795 ns/iter (+/- 1,887,591)
#[macro_use]
extern crate bencher;
use bencher::Bencher;
use nth_prime::*;
fn calc_primes(b: &mut Bencher) {
b.iter(|| primes(10000));
}
benchmark_group!(benches, calc_primes);
benchmark_main!(benches);
I expect rust to outperform it, but i struggle to get rid of flaws in below code. So here's the call for help optimising this particular implementation.
- I don't like those
Vec
allocations, some are surely not needed... - could probably pass more references here and there
- other implementation uses logical
OR
on fvecbitarray
which is super quick. Logic in below rust express same approach in thefold
call, but this is usingVec<bool>
which must be slow vs bitarray equivalent... - use different datatype instead of
Vec<bool>
, wouldn't mind allocating massive byte-array and operate with bit-shifts etc., but can't imagine putting this together...
pub fn primes(n: u64) -> Vec<usize> {
if n < 4 {
vec![2]
} else {
base(primes((n as f64).sqrt().ceil() as u64), n)
}
}
fn base(r: Vec<usize>, p: u64) -> Vec<usize> {
let fvec = (0..p).map(|_| false).collect::<Vec<bool>>();
let z = r
.iter()
.map(|&x| (0..x).map(|y| y > 0).collect::<Vec<bool>>())
.map(|x| {
(0..p)
.map(|n| !x[(n % (x.len() as u64)) as usize])
.collect::<Vec<bool>>()
})
.fold(fvec, |acc, x| {
acc.iter().zip(x).map(|(a, b)| *a || b).collect()
});
let yy: Vec<usize> = z
.iter()
.enumerate()
.filter_map(|(i, x)| if !*x { Some(i) } else { None })
.skip(1)
.collect();
r.iter().chain(yy.iter()).map(|x| *x).collect()
}
Regarding prime number generation algorithms, there's outstanding answer in this thread, so it's not a question about optimising algorithm, but this particular implementation without external crates (it's a learning exercise).
EDIT:
After some hints from comments section, got it significantly optimised:
test calc_primes ... bench: 4,776,400 ns/iter (+/- 1,427,534)
test calc_primes_sieve ... bench: 644,220 ns/iter (+/- 1,655,581)
test calc_primes_2 ... bench: 268,598 ns/iter (+/- 46,440)
- Replaced 'inefficient'
Vec<bool>
@fold
bitmap with idiomatic iterator way of :step_by
- Dropped immutable types in favour for a state (is there a way to keep immutable approach ?)
fn base2(head: Vec<u64>, p: u64) -> Vec<u64> {
let mut fvec = (0..p).map(|_| false).collect::<Vec<bool>>();
head.iter()
.map(|&x| {
(0..p)
.step_by(x as usize)
.for_each(|y| fvec[y as usize] = true)
})
.for_each(drop);
let tail: Vec<u64> = fvec.iter().enumerate()
.filter_map(|(i, x)| if !*x { Some(i as u64) } else { None })
.skip(1)
.collect();
head.iter().chain(tail.iter()).cloned().collect()
}
EDIT 2:
For reference, here's the 'other' language implementation:
q)p:{$[x<4; enlist 2; r, 1_ where not any x#'not til each r:p ceiling sqrt x]}
q)\ts do[1000;p 10000]
474 443088
=
474,000 ns/iter
For 64bit q v4.0:
q)\t do[1000;p 10000]
273
=
273,000 ns/iter
Not to bad for interpreted language.
EDIT 3
Added more benchmarks for new implementations from answers.
(as appeared in chronological order)
Was able to shave off tad bit with no skip(1)
and diff fvec
allocation
New playground code.
test ssqrt_p_1__vec_bool_n_mod ... bench: 6,838,150 ns/iter (+/- 627,389)
test sieve_p_2__mut_step_by ... bench: 367,229 ns/iter (+/- 38,279)
test ssqrt_p_3__mut_step_by ... bench: 266,189 ns/iter (+/- 56,215)
test ssqrt_p_4__push_step_by_mod ... bench: 1,255,206 ns/iter (+/- 262,107)
test sieve_p_5__for_loop_mut ... bench: 441,397 ns/iter (+/- 47,077)
test ssqrt_p_6__no_skip ... bench: 176,186 ns/iter (+/- 24,765)
StdDev plays big role now... running with:
taskset -c 0 cargo bench --jobs 1
fn base3(head: Vec<u64>, p: u64) -> Vec<u64> {
let mut fvec = vec![false; p as usize]; fvec[1] = true;
head.iter()
.map(|&x| {
(0..p)
.step_by(x as usize)
.for_each(|y| fvec[y as usize] = true)
})
.for_each(drop);
let tail: Vec<u64> = fvec
.iter()
.enumerate()
.filter_map(|(i, x)| if !*x { Some(i as u64) } else { None })
.collect();
head.iter().chain(tail.iter()).cloned().collect()
}
EDIT 5
Surprisingly below no_chain
is slower than base3:no_skip
version. Guess that's compiler magic (tail call optimisation in recursive functions ???).
fn base4(head: Vec<u64>, p: u64) -> Vec<u64> {
let mut fvec = vec![false; p as usize]; fvec[1] = true;
head.iter()
.map(|&x| {
(0..p)
.step_by(x as usize)
.for_each(|y| fvec[y as usize] = true);
fvec[x as usize] = false;
})
.for_each(drop);
fvec.iter().enumerate()
.filter_map(|(i, x)| if !*x { Some(i as u64) } else { None })
.collect()
}