7

I am writting a thread safe object that basically represents a double and uses a lock to ensure safe reading and writing. I use many of these objects (20-30) in a piece of code that is reading and writing them all 100 times per second, and I am measuring the average computation time of each of these time steps. I started looking at a few options for implementations of my getter and after running many tests and collecting many samples to average out my measurement of computation time I find certain implementations perform consistently better than others, but not the implementations I would expect.

Implementation 1) Computation time average = 0.607ms:

protected override double GetValue()
{
    lock(_sync)
    {
        return _value;
    }
}

Implementation 2) Computation time average = 0.615ms:

protected override double GetValue()
{
    double result;
    lock(_sync)
    {
        result = _value;
    }
    return result;
}

Implementation 3) Computation time average = 0.560ms:

protected override double GetValue()
{
    double result = 0;
    lock(_sync)
    {
        result = _value;
    }
    return result;
}

What I expected: I had expected to see implementation 3 be the worst of the 3 (this was actually my original code, so it was chance or lazy coding that I had written it this way), but surprisingly it is consistently the best in terms of performance. I would expect implementation 1 to be the fastest. I also expected implementation 2 to be at least as fast, if not faster than implementation 3 since I am just removing an assignment to the double result that is overwritten anyways, so it is unnecessary.

My question is: can anyone explain why these 3 implementations have the relative performance that I have measured? It seems counter-intuitive to me and I would really like to know why.

I realize that these differences are not major, but their relative measure is consistent every time I run the test, collecting thousands of samples each test to average out the computation time. Also, please keep in mind I am doing these tests because my application requires very high performance, or at least as good as I can reasonably get it. My test case is just a small test case and a my code's performance will be important when running in release.

EDIT: note that I am using MonoTouch and running the code on an iPad Mini device, so perhaps it's nothing related to c# and more something related to MonoTouch's cross compiler.

Camputer
  • 387
  • 2
  • 16
  • 1
    Considering how near-identical these are (even from an IL perspective I don't expect them to be computationally _that_ much different) and how close the benchmark times are, I suspect the big culprit is your testing method. Is it compiled for `release` mode? How are you benchmarking? Are other processes on the computer possibly interfering with processing resources/time? Are you simulating lock contention at all, and if so, how? EDIT: Also, I suspect your real-world application has more complicated work in the `lock` block? Because as-is, the lock seems superfluous to me. – Chris Sinclair Apr 11 '13 at 13:59
  • 3
    1) Without contention for the lock the tests are meaningless. – H H Apr 11 '13 at 14:02
  • please show the code you are using to test – omer schleifer Apr 11 '13 at 14:02
  • 3
    2) Take a good look at `System.Threading.Interlocked` – H H Apr 11 '13 at 14:03
  • @ChrisSinclair I agree that the benchmark times and implementations are nearly identical, and I admit that. I am just curious if these slight differences do have some differences that might affect performance of the IL code. I am implementing this as a thread safe type of value that will be used across a simulator running in "real-time" (no, not hard real time, just trying to run things like filters, solvers, etc at a sufficiently fast rate to simulate them). That's why I'm curious about small but consistent performance differences, since this could be simulating a model with 100s of states. – Camputer Apr 11 '13 at 14:09
  • @Camputer The changes you made should not translate into any functional difference at all. They will most likely be optimized away entirely in the event you have optimizations on. The differences are almost certainly as a result of poorly designed testing code. – Servy Apr 11 '13 at 14:11
  • 1
    When compiled in x86 Release mode, the first two implementations have **identical** native code. The third implementation is also the same except for extra `fldz` and `fstp` instructions to initialize the variable to 0. – Michael Liu Apr 11 '13 at 14:12
  • @ChrisSinclair No, I am currently not compiling for release mode. I am benchmarking by simulating the same test case for 20 seconds (long enough to see the measurement average "settle") on a mobile device. I am not simulating any lock contention as I'm more interested in the differences in the code that aren't related to the lock, and trying to learn what the IL implications might be (I'm very new to c#). Also, can you please explain why you would say that the lock seems superfluous? Reads/Writes of doubles are not atomic, so I thought this would ensure safe reading across threads, no? – Camputer Apr 11 '13 at 14:14
  • I think this may be a super-duper-micro optimization. Get it working first; almost _certainly_ your bottlenecks will _not_ be with this bit of code posted, or if it is, will be due to lock contention with the writers. Once everything is working, then identify the _real_ bottlenecks. – Chris Sinclair Apr 11 '13 at 14:14
  • @Camputer Learn something new every day! Sorry, I was not aware of that tidbit, so yes, the [lock is necessary](http://stackoverflow.com/a/3677373/1269654). You should definitely compile for release mode (that _is_ how you would be deploying it, correct?) Typically, any benchmarking you do should be with the configuration settings expected to be used in production. – Chris Sinclair Apr 11 '13 at 14:18
  • @ChrisSinclair Yes, I agree that I should compile in release mode. I will have to implement a better benchmarking system as well (instead of printing the average to the console). I am equally interested in ensuring super-duper optimization (trying to run 100Hz - 1000Hz simulations on a computationally limited mobile device) as I am interested in learning any new optimization tips/tricks in c#. If it's not the case here that there IS some kind of tricky optimization that's fine, I will always be interested in learning any tricks to save cycles :) – Camputer Apr 11 '13 at 14:24
  • @henk if this is read-heavy (rather than write-heavy), there are better tools than `Interlocked` - the manual box comes out the fastest in my test (below) – Marc Gravell Apr 12 '13 at 09:01

2 Answers2

15

Frankly, there are other, better approaches here. The following outputs (ignoring the x1, which is for JIT):

x5000000
Example1        128ms
Example2        136ms
Example3        129ms
CompareExchange 53ms
ReadUnsafe      54ms
UntypedBox      23ms
TypedBox        12ms

x5000000
Example1        129ms
Example2        129ms
Example3        129ms
CompareExchange 52ms
ReadUnsafe      53ms
UntypedBox      23ms
TypedBox        12ms

x5000000
Example1        129ms
Example2        161ms
Example3        129ms
CompareExchange 52ms
ReadUnsafe      53ms
UntypedBox      23ms
TypedBox        12ms

All of these are thread safe implementations. As you can see, the fastest is a typed box, followed by an untyped (object) box. Next comes (at about the same speed) Interlocked.CompareExchange / Interlocked.Read - note that the latter only supports long, so we need to do some bit-bashing to treat that as a double.

Obviously, test on your target framework.

For fun, I also tested a Mutex; on the same scale test, that takes about 3300ms.

using System;
using System.Diagnostics;
using System.Threading;
abstract class Experiment
{
    public abstract double GetValue();
}
class Example1 : Experiment
{
    private readonly object _sync = new object();
    private double _value = 3;
    public override double GetValue()
    {
        lock (_sync)
        {
            return _value;
        }
    }
}
class Example2 : Experiment
{
    private readonly object _sync = new object();
    private double _value = 3;
    public override double GetValue()
    {
        lock (_sync)
        {
            return _value;
        }
    }
}

class Example3 : Experiment
{
    private readonly object _sync = new object();
    private double _value = 3;
    public override double GetValue()
    {
        double result = 0;
        lock (_sync)
        {
            result = _value;
        }
        return result;
    }
}

class CompareExchange : Experiment
{
    private double _value = 3;
    public override double GetValue()
    {
        return Interlocked.CompareExchange(ref _value, 0, 0);
    }
}
class ReadUnsafe : Experiment
{
    private long _value = DoubleToInt64(3);
    static unsafe long DoubleToInt64(double val)
    {   // I'm mainly including this for the field initializer
        // in real use this would be manually inlined
        return *(long*)(&val);
    }
    public override unsafe double GetValue()
    {
        long val = Interlocked.Read(ref _value);
        return *(double*)(&val);
    }
}
class UntypedBox : Experiment
{
    // references are always atomic
    private volatile object _value = 3.0;
    public override double GetValue()
    {
        return (double)_value;
    }
}
class TypedBox : Experiment
{
    private sealed class Box
    {
        public readonly double Value;
        public Box(double value) { Value = value; }

    }
    // references are always atomic
    private volatile Box _value = new Box(3);
    public override double GetValue()
    {
        return _value.Value;
    }
}
static class Program
{
    static void Main()
    {
        // once for JIT
        RunExperiments(1);
        // three times for real
        RunExperiments(5000000);
        RunExperiments(5000000);
        RunExperiments(5000000);
    }
    static void RunExperiments(int loop)
    {
        Console.WriteLine("x{0}", loop);
        RunExperiment(new Example1(), loop);
        RunExperiment(new Example2(), loop);
        RunExperiment(new Example3(), loop);
        RunExperiment(new CompareExchange(), loop);
        RunExperiment(new ReadUnsafe(), loop);
        RunExperiment(new UntypedBox(), loop);
        RunExperiment(new TypedBox(), loop);
        Console.WriteLine();
    }
    static void RunExperiment(Experiment test, int loop)
    {
        // avoid any GC interruptions
        GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);
        GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);
        GC.WaitForPendingFinalizers();

        double val = 0;
        var watch = Stopwatch.StartNew();
        for (int i = 0; i < loop; i++)
            val = test.GetValue();
        watch.Stop();
        if (val != 3.0) Console.WriteLine("FAIL!");
        Console.WriteLine("{0}\t{1}ms", test.GetType().Name,
            watch.ElapsedMilliseconds);

    }

}
Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • Can you remove the unsafe modifier in the TypedBox.GetValue? – Guillaume86 Apr 12 '13 at 09:33
  • @Guillaume86 sorry, copy-paste error; yes - will edit; it doesn't change the numbers – Marc Gravell Apr 12 '13 at 09:52
  • @MarcGravell Thank you for the very interesting comparisons! I have a question, is the TypedBox implementation only allowing reads because of the readonly keyword? I do need to read and write safely, I just did not include the writing code since I was not testing that in my original post, although I used the same lock mechanism in my SetValue method. If you add a method SetValue in the typed and untyped Box examples, would these implementations still be thread safe? I don't see what makes the read in these classes exclusive/atomic, am I missing something? (new to c#!) – Camputer Apr 12 '13 at 13:44
  • @Camputer to write you assign a new Box(value) - although thinking about it you might need to add the volatile keyword to the two fields in the two box examples. The language specification guarantees that access to a reference is always atomic, and the object being immutable guarantees that once we've dereferenced the value is fine. – Marc Gravell Apr 12 '13 at 14:33
6

Measuring only reads for concurrency is misleading, your cache will give you orders of magnitude better results than real use case would. So I added SetValue to Marc's example:

using System;
using System.Diagnostics;
using System.Threading;

abstract class Experiment
{
    public abstract double GetValue();
    public abstract void SetValue(double value);
}

class Example1 : Experiment
{
    private readonly object _sync = new object();
    private double _value = 3;
    public override double GetValue()
    {
        lock (_sync)
        {
            return _value;
        }
    }

    public override void SetValue(double value)
    {
        lock (_sync)
        {
            _value = value;
        }

    }

}
class Example2 : Experiment
{
    private readonly object _sync = new object();
    private double _value = 3;
    public override double GetValue()
    {
        lock (_sync)
        {
            return _value;
        }
    }

    public override void SetValue(double value)
    {
        lock (_sync)
        {
            _value = value;
        }
    }

}



class Example3 : Experiment
{
    private readonly object _sync = new object();
    private double _value = 3;
    public override double GetValue()
    {
        double result = 0;
        lock (_sync)
        {
            result = _value;
        }
        return result;
    }

    public override void SetValue(double value)
    {
        lock (_sync)
        {
            _value = value;
        }
    }
}

class CompareExchange : Experiment
{
    private double _value = 3;
    public override double GetValue()
    {
        return Interlocked.CompareExchange(ref _value, 0, 0);
    }

    public override void SetValue(double value)
    {
        Interlocked.Exchange(ref _value, value);
    }
}
class ReadUnsafe : Experiment
{
    private long _value = DoubleToInt64(3);
    static unsafe long DoubleToInt64(double val)
    {   // I'm mainly including this for the field initializer
        // in real use this would be manually inlined
        return *(long*)(&val);
    }
    public override unsafe double GetValue()
    {
        long val = Interlocked.Read(ref _value);
        return *(double*)(&val);
    }

    public override void SetValue(double value)
    {
        long intValue = DoubleToInt64(value);
        Interlocked.Exchange(ref _value, intValue);
    }
}
class UntypedBox : Experiment
{
    // references are always atomic
    private volatile object _value = 3.0;
    public override double GetValue()
    {
        return (double)_value;
    }

    public override void SetValue(double value)
    {
        object valueObject = value;
        _value = valueObject;
    }
}
class TypedBox : Experiment
{
    private sealed class Box
    {
        public readonly double Value;
        public Box(double value) { Value = value; }

    }
    // references are always atomic
    private volatile Box _value = new Box(3);
    public override double GetValue()
    {
        Box value = _value;
        return value.Value;
    }

    public override void SetValue(double value)
    {
        Box boxValue = new Box(value);
        _value = boxValue;
    }
}
static class Program
{
    static void Main()
    {
        // once for JIT
        RunExperiments(1);
        // three times for real
        RunExperiments(5000000);
        RunExperiments(5000000);
        RunExperiments(5000000);
    }
    static void RunExperiments(int loop)
    {
        Console.WriteLine("x{0}", loop);
        RunExperiment(new Example1(), loop);
        RunExperiment(new Example2(), loop);
        RunExperiment(new Example3(), loop);
        RunExperiment(new CompareExchange(), loop);
        RunExperiment(new ReadUnsafe(), loop);
        RunExperiment(new UntypedBox(), loop);
        RunExperiment(new TypedBox(), loop);
        Console.WriteLine();
    }
    static void RunExperiment(Experiment test, int loop)
    {
        // avoid any GC interruptions
        GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);
        GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);
        GC.WaitForPendingFinalizers();

        int threads = Environment.ProcessorCount;

        ManualResetEvent done = new ManualResetEvent(false);

        // Since we use threads, divide the original workload
        //
        int workerLoop = Math.Max(1, loop / Environment.ProcessorCount);
        int writeRatio = 1000;
        int writes = Math.Max(workerLoop / writeRatio, 1);
        int reads = workerLoop / writes;

        var watch = Stopwatch.StartNew();

        for (int t = 0; t < Environment.ProcessorCount; ++t)
        {
            ThreadPool.QueueUserWorkItem((state) =>
                {
                    try
                    {
                        double val = 0;

                        // Two loops to avoid comparison for % in the inner loop
                        //
                        for (int j = 0; j < writes; ++j)
                        {
                            test.SetValue(j);
                            for (int i = 0; i < reads; i++)
                            {
                                val = test.GetValue();
                            }
                        }
                    }
                    finally
                    {
                        if (0 == Interlocked.Decrement(ref threads))
                        {
                            done.Set();
                        }
                    }
                });
        }
        done.WaitOne();
        watch.Stop();
        Console.WriteLine("{0}\t{1}ms", test.GetType().Name,
            watch.ElapsedMilliseconds);

    }
}

Results are, at 1000:1 read:write ratio:

x5000000
Example1        353ms
Example2        395ms
Example3        369ms
CompareExchange 150ms
ReadUnsafe      161ms
UntypedBox      11ms
TypedBox        9ms

100:1 (read:write)

x5000000
Example1        356ms
Example2        360ms
Example3        356ms
CompareExchange 161ms
ReadUnsafe      172ms
UntypedBox      14ms
TypedBox        13ms

10:1 (read:write)

x5000000
Example1        383ms
Example2        394ms
Example3        414ms
CompareExchange 169ms
ReadUnsafe      176ms
UntypedBox      41ms
TypedBox        43ms

2:1 (read:write)

x5000000
Example1        550ms
Example2        581ms
Example3        560ms
CompareExchange 257ms
ReadUnsafe      292ms
UntypedBox      101ms
TypedBox        122ms

1:1 (read:write)

x5000000
Example1        718ms
Example2        745ms
Example3        730ms
CompareExchange 381ms
ReadUnsafe      376ms
UntypedBox      161ms
TypedBox        200ms

*Updated the code to remove the unnecessary ICX operations on write, since the value is overwritten always. Also fixed the formula to compute the number of reads to divide by threads (same work).

Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
  • Thank you for adding the SetValue in there as well, since I think it makes a difference. Question: is it not expensive to be creating new instances of Box in the TypedBox implementation? Is that why its execution time starts to creep up as you increase the proportion of writes in your results? – Camputer Apr 12 '13 at 13:50
  • To answer your question one would have to actually [profile](http://msdn.microsoft.com/en-us/library/z9z62c29.aspx) the code. – Remus Rusanu Apr 12 '13 at 14:28
  • Very thorough follow up, thanks. For info, I meant to add "volatile" the the box/typed-box example - otherwise could get dirty reads - my mistake. – Marc Gravell Apr 12 '13 at 14:37
  • @RemusRusanu I mean in general, if I am performing a large amounts of reads and writes running time sensitive computations, my instinct tells me to avoid instantiating new objects if possible. Does that make sense? – Camputer Apr 12 '13 at 14:39
  • @Camputer: it sure does. I am bit puzzled by the results myself, as the box examples seems way too good for me. But keep in mind that this example only shows basically safe atomic. The threads are overwriting each other values and is a last-write-wins situation, which is unlikely to be the use real case. – Remus Rusanu Apr 12 '13 at 14:43
  • @MarcGravell: I must say that I'm puzzled to extremes why is the Box/object case so fast... my money was all on ICX. I wonder if there are layers of IL/JIT that hide what's happening at bare metal level, but I don't have time now to digg much deeper. – Remus Rusanu Apr 12 '13 at 14:46
  • @Camputer: fyi the code has a slight 'statistics' problem, in as the writes are not considered as part of the 'work'. At 1:1 ratio that means that for 5000000 reads there are also 5000000 *writes* against the same walltime. Although fixing that changes the measured time insignificantly for best case, hinting that most time is indeed cache invalidation/memory access, not CPU work. – Remus Rusanu Apr 12 '13 at 14:50
  • If I had to guess, the simple dereference allows a nice short IL path ideal for inlining, and a nice short instruction pipe that will be ideal for the cpu since it doesn't have any branches. – Marc Gravell Apr 12 '13 at 17:39