3

I have 10 threads and a Vec of length 100.

Can I have thread 0 work on elements 0-9 (sort them, for example), while thread 1 is working on elements 10-19, etc.?

Or do I have to use a Vec<Vec<>> for this? (Which I would rather avoid, because the elements would no longer be contiguous in memory)

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
MWB
  • 11,740
  • 6
  • 46
  • 91
  • A search on the Internet for "rust vec thread" leads to http://stackoverflow.com/q/28599334/155423; http://stackoverflow.com/q/31644152/155423; http://stackoverflow.com/q/33818141/155423; and many others. Please make sure to do your own [effort and show that effort when asking a question](http://meta.stackoverflow.com/q/261592/155423). – Shepmaster Oct 08 '16 at 16:21
  • Maybe this approach is useful for you: https://stackoverflow.com/a/70851207/286335 – cibercitizen1 Jan 25 '22 at 15:25

1 Answers1

14

Yes, you can. You asked about the mutable case, but I'll preface by saying that if the Vec is read only (e.g. for a reduction) you can safely send an immutable reference to the specific slice you want in each thread. You can do this by simply using something like &my_vec[idx1..idx2] in a loop.

For the mutable case it's a bit trickier since the borrow tracker is not sophisticated enough to allow non-overlapping borrows of a Vec. However, there are a number of methods, notably split_at_mut you can call to get these subslices. By far the easiest is the chunks_mut iterator documented here. (Note that there is a matching chunks iterator for the immutable case so you only need to make minor changes when writing either case).

Be aware that the chunks and chunks_mut functions take the size of each chunk, not the number of chunks. However, deriving one from the other is fairly straightforward.

I would like to give a few words of caution with the mutable case, however. If you split the data evenly you may get abysmal performance. The reason is that the CPU doesn't work on individual addresses, instead it works on blocks of memory known as cache lines which are 64-bytes long. If multiple threads work on a single cache line, they have to write and read slower memory in order to ensure consistency between threads.

Unfortunately, in safe Rust there's no easy way to determine where on a cache line a Vec's buffer starts (because the buffer's start may have been allocated in the middle of a CPU cache line), most of the methods I know of to detect this involve twiddling with the lower bytes of the actual pointer address. The easiest way to handle this is to simply add a 64-byte pad of nonsense-data between each chunk you want to use. So, for instance, if you have a Vec containing 1000 32-bit floats and 10 threads, you simply add 16 floats with a dummy value (since 32-bits = 4-bytes, 16*4=64=1 cache line) between each 100 of your "real" floats and ignore the dummies during computation.

This is known as false sharing, and I encourage you to look up other references to learn other methods of dealing with this.

Note that the 64-byte line size is guaranteed on x86 architectures. If you're compiling for ARM, PowerPC, MIPS, or something else this value can and will vary.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
Linear
  • 21,074
  • 4
  • 59
  • 70