Fastest x86 assembly code to synchronize access to an array?

Question

What is the fastest x86 assembly code to synchronize access to an array in memory?

To be more precise: We have a malloc'ed continuous single-paged region in memory and the OS will not page-out this region for the duration of our experiment. One thread will write to the array, one thread will read from the array. the array is small, but larger than the atomic-write capability of your cpu (so that a separate lock is acutally required)

"fastest": the effective speed: Do not just assume the length of bytecode is significant but take into account the caching behavior of the lock and branching behavior regarding surrounding code.

It has to work on x86-32 and/or x86-64

It has to work on-top of (or descendents of) Windows since XP, Linux since kernel 2.2, or MaxOs X (in user-mode).

Please no "it depends"-responses: If it depends on anything I have not specified here just make up your own example(s) and state what is fastest in that/those case(s).

Post code! (This is to prevent vague descriptions)

Post not only your 2-line LOCK + CMPXCHG compare&swap but show us how you integrate it with the read instructions in the one thread and the write-instructions in the other.

If you like, explain your tweaks for cache-optimality and how to avoid branch-mispredictions if the branch-target is dependant on (1) whether you get the lock or not (2) what the first byte of a larger-read is.

If you like distinguish between multiprocessing and task-switching: how will your code perform if the threads are not performed on 2 cpus but just get hold of one?

@Ken White: haha funny. or are you serious? if so: take a look at the terminology i use, and the questions i have answered. — Bernd Elkemann, Mar 12 '11 at 16:54
@Ken - I would be very interested in the school that assigns this type of question as homework. — linuxuser27, Mar 12 '11 at 16:55
@eznme, I read the terminology used. It looks like something directly out of a textbook. "Post not only your 2-line... but show". No offense intended - it didn't look like something a typical question would contain. @linuxuser27, have you looked at any of the advanced courses at MIT or RIT? — Ken White, Mar 12 '11 at 17:00
@Ken - I went to RIT (class of 2007). As for MIT, no. But I will check out their courseware site. — linuxuser27, Mar 12 '11 at 17:07
@eznme, you should. :) As I said, no offense was intended. Actually, it's pretty impressive. :) I'll leave my original comment, as I don't hide when I've been stupid in public. +1 for the question, too, with my apologies. — Ken White, Mar 12 '11 at 17:08
@Ken White: No need for apologies, its quite understandable and funny. Thank you very much for your +1, I hope this makes some experts show-off their skills. — Bernd Elkemann, Mar 12 '11 at 17:11
@eznme: You do realize that there are *huge* differences in the caching, instruction execution and, therefore, speed among the various x86 processors, right? — thkala, Mar 12 '11 at 17:11
@thkala: Absolutely, that's why I included my "it depends"-suggestion. Are you an expert in the field? If so please post your solution; i would hate to post my code as the first answer. — Bernd Elkemann, Mar 12 '11 at 17:13
This is verging on "not a real question" territory. If it was a real question you would be asking us to make up the requirements. If it was a real question it would be one question rather than 5 or 6. — David Heffernan, Mar 12 '11 at 17:31
@David Heffernan: If you see "5 or 6" questions here just answer one, its specifically the "it depends"-cases i would like to hear about. — Bernd Elkemann, Mar 12 '11 at 18:13
So you already have a solution that you think solves the problem, but rather than post it for review and possible improvement, you're challenging us to come up with our own solutions? If nothing else, this question should have the "code golf" tag. But I'm thinking this isn't a real question. — Jim Mischel, Mar 12 '11 at 22:15
The question is "What is the **fastest** x86 assembly code to synchronize access to an array in memory?" If I thought that my code would come any close, i certainly would not go through all this trouble. — Bernd Elkemann, Mar 12 '11 at 22:24
You'll get a better response if you post your code and ask for reviews. — Jim Mischel, Mar 12 '11 at 22:36
You should read this excellent article with code included : http://www.codeproject.com/KB/threads/CritSectEx.aspx . It uses intrisics to perform efficient locking, but I am sure you are able to translate that to x86/x64 asm if you are asking such a question (sorry, windows only) — J.N., Mar 13 '11 at 04:57

ninjalj · Answer 1 · 2011-03-13T22:18:49.760

2

Really, the answer is "it depends". What's the usage pattern of your array? Is it read-mostly? Is it update-mostly and you can get away with imprecise results on reading (using per-cpu arrays)? Updates are so infrequent that RCU would give serious performance improvements?

There are lots of tradeoffs here, see Paul McKenney's book: Is Parallel Programming Hard, And, If So, What Can You Do About It?

edited Mar 13 '11 at 22:18

answered Mar 13 '11 at 19:33

ninjalj

42,493
9
106
148

Olof Forshell · Answer 2 · 2011-03-13T17:58:18.153

I don't get it. Bus-locking (lock prefix or xchg mem,reg instruction) and speed have little to do with each other. It's about physically synchronizing the CPU with the slowest active device in your system - which might be connected via the 33 MHz PCI or some such - and you can bet it will be much slower than a RAM access that was not in the cache. So expect 300-3000 CPU clock cycles depending on how long you need to wait for the device. If no devices are active you'll still need to wait for the respective buses to acknowledge the lock.

Fastest code? Forget it. You need to either accept that this is how bus locks work or find other ways to synchronize that do not require bus-locking.

score -1 · Answer 3 · answered Mar 13 '11 at 18:52

-1

If locking performance is important, you're doing something wrong.

answered Mar 13 '11 at 18:52

zvrba

24,186
3
55
65

Yeah, I mostly agree. If the work chunks are big enough it amortizes the locking cost. – Zan Lynx Mar 13 '11 at 23:22

Fastest x86 assembly code to synchronize access to an array?

3 Answers3

Linked