Why lock is 240% faster than ReaderWriterLockSlim?

Question

I have read another SO question: When is ReaderWriterLockSlim better than a simple lock?

And it does not explain exactly why ReaderWriterLockSlim so slow compared to lock.

My test is yes - testing with zero contention but still it doesnt explain the staggering difference.

Read lock takes 2.7s, Write lock 2.2s, lock 1.0s

This is complete code:

using System;
using System.Diagnostics;
using System.Threading;

namespace test
{
    internal class Program
    {
        static int[] data = new int[100000000];
        static object lock1 = new object();
        static ReaderWriterLockSlim lock2 = new ReaderWriterLockSlim();

        static void Main(string[] args)
        {
            for (int z = 0; z < 3; z++)
            {
                var sw = Stopwatch.StartNew();

                for (int i = 0; i < data.Length; i++)
                {
                    lock (lock1)
                    {
                        data[i] = i;
                    }
                }

                sw.Stop();

                Console.WriteLine("Lock: {0}", sw.Elapsed);

                sw.Restart();

                for (int i = 0; i < data.Length; i++)
                {
                    try
                    {
                        lock2.EnterReadLock();
                        data[i] = i;
                    }
                    finally
                    {
                        lock2.ExitReadLock();
                    }
                }

                sw.Stop();

                Console.WriteLine("Read: {0}", sw.Elapsed);

                sw.Restart();

                for (int i = 0; i < data.Length; i++)
                {
                    try
                    {
                        lock2.EnterWriteLock();
                        data[i] = i;
                    }
                    finally
                    {
                        lock2.ExitWriteLock();
                    }
                }

                sw.Stop();

                Console.WriteLine("Write: {0}\n", sw.Elapsed);

            }

            Console.ReadKey(false);
        }
    }
}

Given that `lock` is so hyper-optimized for the no-contention case, I'm frankly surprised that an uncontested `ReaderWriterLockSlim` is only twice as expensive — canton7, Jun 01 '22 at 13:00
Both are slower than non-blocking code and using the correct types. If you want to coordinate producers and consumers use Channel — Panagiotis Kanavos, Jun 01 '22 at 13:01
I'm with Canton on this one -- what's staggering here is not that it's slower but that it's pretty damn fast! Benchmarking synchronization primitives is a dangerous thing to do, because it may mislead you into putting performance before correctness. If you're not yourself in the business of writing them for libraries, you should probably steer away from any kind of test like this until you have real code with real contention and real behavior to profile and optimize, and *then* look into it -- carefully. — Jeroen Mostert, Jun 01 '22 at 13:20
Stopwatch can give you a rough estimate, but it is not a particular Benchmarking tool. I would guess you'll get numbers that go in the direction as expected by canton7 if you used Benchmark.NET. — Fildor, Jun 01 '22 at 13:20
Remember that `ReaderWriterLockSlim` has to do a lot more bookkeeping than a simple `Monitor`. On the other hand, you'd only use a read/write lock if you're expecting contention: if you're not expecting any, then a simple `lock` will do. So benchmarking the no-contention case is pretty pointless. — canton7, Jun 01 '22 at 13:22
I don't think that you are using the `ReaderWriterLockSlim` correctly. AFAIK the `EnterWriteLock`/`EnterReadLock` should be placed *before* entering the `try` block. Could you redo the benchmark with the correct usage? — Theodor Zoulias, Jun 01 '22 at 13:26
@TheodorZoulias You mean like in the examples [here](https://learn.microsoft.com/en-us/dotnet/api/system.threading.readerwriterlockslim?view=net-6.0#examples)? — Fildor, Jun 01 '22 at 13:33
Yes, it should go outside the `try/finally` because you don't want to call `ExitReadLock()` if the `EnterReadLock()` failed (for example, by throwing `LockRecursionException`) — Matthew Watson, Jun 01 '22 at 13:48
Are you interested for a low level/technical explanation about why acquiring an uncontested `lock` is faster than acquiring an uncontested `ReaderWriterLockSlim`, including precise measurements of the individual IL instructions emitted by the two operations, or you are looking just for a high-level/logical explanation about why these two primitives perform like this under the conditions simulated by your benchmark? — Theodor Zoulias, Jun 01 '22 at 14:11
@TheodorZoulias - the latter (timing IL instructions is not a trivial task, so I dont want to know). @Panagiotis/@Jeroen/@Fildor/@Mathew - thank you for your input. duly noted. @canton - I always thought `monitor` is pretty expensive because it goes into kernel mode which eats like 40us at least and back. do you have proof that lock "hyper optimisation" does not in fact enters `monitor` in no-contention case? — Boppity Bop, Jun 01 '22 at 20:16
`I always thought monitor is pretty expensive because it goes into kernel mode` That is not correct, it's possible you're confusing with Mutex. An uncontended Monitor is pretty much just an atomic CAS operation on the object header. It's _very_ fast. I do a brief explanation of how it works at the 16 minute mark in this talk if you're interested: https://youtu.be/k_tavcIrrss?t=960 — Kevin Gosse, Jun 02 '22 at 08:28
@KevinGosse - if you write it as an answer ill mark it. thank you — Boppity Bop, Jun 02 '22 at 16:12

score 3 · Answer 1 · answered Jun 02 '22 at 06:02

You are looking at two devices. At the left is a lock. At the right is a ReaderWriterLockSlim.

The device at the left is used to control a single electric lamp from a single location. The device at the right is used to control two lamps from two different locations.¹ The device at the left is cheaper to buy, it requires less wiring, it is simpler to install and operate, and it loses less energy due to heat than the device at the right.

The analogy with the SPST/DPDT electric switches is probably far from perfect, but my point is that a lock is comparatively a simpler mechanism than the ReaderWriterLockSlim. It is used to enforce a single policy to a homogenous group of worker threads. On the other hand a ReaderWriterLockSlim is used to enforce two different policies to two separate groups of workers (readers and writers), regarding to how they interact with members of the same group and the other group. It should be of no big surprise that the more complex mechanism has a higher operational cost (overhead) than the simpler mechanism. That's the cost that you have to pay in order to get finer control of the worker threads.

¹ _{Or maybe not. I am not an electrician!}

I love this way to explain. BTW, it's not clear what actually the right device does, but it seems a DPDT. What you probably mean, is to command a lamp from two or more different locations. However, if you choose to wire many lamps instead of a single one, it doesn't change anything to the circuit behavior. — Mario Vernari, Jun 02 '22 at 06:16
I think the switch on the right is often used to select two different speeds for a fan, e.g. in an [extractor hood](https://en.wikipedia.org/wiki/Kitchen_hood). [Example](https://www.gram.dk/produkter/emhaetter/efu-604-90-x). [Picture](https://www.gram.dk/Files/Billeder/Ecom/Produkter/EFU%20604-90%20X.jpg). — Peter Mortensen, Aug 15 '22 at 10:25
@PeterMortensen maybe you are right. But I am not sure why it has two contacts on each side. I find this piece of hardware quite puzzling! — Theodor Zoulias, Aug 15 '22 at 10:32

score 1 · Accepted Answer · edited Aug 15 '22 at 10:15

1

Thanks to canton7 and Kevin Gosse, I found my 2013 question perfectly answered by Hans Passant: When exactly does .NET Monitor go to kernel-mode?

So lock is faster in a no-contention scenario simply because it has lighter logic and kernel mode is not involved.

edited Aug 15 '22 at 10:15

Peter Mortensen

30,738
21
105
131

answered Jun 02 '22 at 16:11

Boppity Bop

9,613
13
72
151

`ReaderWriterLockSlim` also avoids calling into the kernel where possible. That's what the "Slim" bit means -- the "Slim" locks are implemented using managed code in user-land as far as possible, and only call into the kernel as a last resort. You can read the source [here](https://source.dot.net/#System.Private.CoreLib/ReaderWriterLockSlim.cs,8c1a3a50bf9c4faf) – canton7 Jun 04 '22 at 09:04
may be. but 240% difference is huge – Boppity Bop Jun 04 '22 at 12:00
Not really. It's just double-and-a-bit. Twice a small number is still a small number. When `Monitor` is doing a single compare/exchange, `ReaderWriterLockSlim` might be doing two compare/exchanges, with a little bit of extra logic. That's still cheap. – canton7 Jun 05 '22 at 10:25
To give a sense of scale, a transition into the kernel costs on the order of microseconds. Context switches also cost in the order of microseconds. A compare/exchange is much cheaper -- so much so that the latency is heavily influenced by what's in cache, but on a good day they're down at nanoseconds. All of your lock operations took in the order of nanoseconds. Had there been a call into the kernel, I'd have expected that to be around 1000 times more expensive, so your test would have taken 2500 seconds (or ~45 *minutes*) rather than 2.5 seconds – canton7 Jun 05 '22 at 10:32
"Twice a small number is still a small number" - its not small if you get 1000s hits per second. – Boppity Bop Jun 06 '22 at 12:16
1

It is still a small number. If you're getting 1000s of hits per second, you're going to have things which are significantly more expensive than an uncontested lock, whether that's a `Monitor` or a `ReaderWriterLockSlim`. When you're dealing with stuff that's orders of magnitude more expensive, double *is* small – canton7 Jun 06 '22 at 12:36

Why lock is 240% faster than ReaderWriterLockSlim?

2 Answers2