6

Rayon looks great for algorithm parallelization of collections, and Faster is great for vectorization (SIMD) on the x86 platform for collections like Vec<f32>. I've tried to combine them and the iterators don't seem to like each other. Is there a way to make use of these two libraries for algorithms which would benefit from both vectorization and parallelization? Like this one from the Faster example:

let lots_of_3s = (&[-123.456f32; 128][..]).iter()
    .map(|v| {
        9.0 * v.abs().sqrt().sqrt().recip().ceil().sqrt() - 4.0 - 2.0
    })
    .collect::<Vec<f32>>();
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
Stan Prokop
  • 5,579
  • 3
  • 25
  • 29
  • 2
    One thing to be careful of are hyperthreaded cores. My understanding is that these virtual cores don't have their own SIMD units and thus you'd be competing with yourself, causing a large amount of cache thrashing. – Shepmaster Jul 09 '18 at 20:33
  • 4
    @Shepmaster: yup, code that can really saturate the SIMD execution units of a whole core with just one thread doesn't need hyperthreading, but that's rare. Like maybe a well-tuned matmul. Even video codecs like x264 and x265 (which heavily use vector ALUs) get a minor speedup from using both logical cores of each physical core, like maybe 15% for `x264 -preset slower` on Skylake i7-6700k on a 1080p video. HT can be a good way to expose more instruction-level parallelism if your code bottlenecks on latency instead of throughput (e.g. a reduction without enough accumulators). – Peter Cordes Jul 09 '18 at 22:09
  • 3
    @Shepmaster: TL:DR: SMT (generic form of Hyperthreading) is a way to expose more instruction-level parallelism to each physical core. Loops with independent iterations don't tend to benefit as much, but hiding cache-miss latency can still help unless the extra cache pressure from 2 threads competing for the same cache causes too many more misses. (Some microarchitectural resources are statically partitioned, e.g. the ROB and store buffer in Intel CPUs, but it's less likely that performance drops off a cliff from shrinking them.) – Peter Cordes Jul 09 '18 at 22:13
  • Hm, I didn't realize there is a catch, that's really good to know. Thank you for the comments. – Stan Prokop Jul 10 '18 at 09:22

1 Answers1

4

You can just use Rayon’s par_chunks and process each chunk with Faster.

let lots_of_3s = (&[-123.456f32; 1000000][..])
    .par_chunks(128)
    .flat_map(|chunk| {
        chunk
            .simd_iter(f32s(0.0))
            .simd_map(|v| {
                f32s(9.0) * v.abs().sqrt().rsqrt().ceil().sqrt() - f32s(4.0) - f32s(2.0)
            })
            .scalar_collect()
    })
    .collect::<Vec<f32>>();
Anders Kaseorg
  • 3,657
  • 22
  • 35