8

I am having severe performance issues when running compute-intensive multiprocessed code on Mono. The simple snippet below, which estimates the value of pi using Monte Carlo methods, demonstrates the issue.

The program spawns a number of threads equal to the number of logical cores on the current machine, and performs an identical computation on each. When run on an Intel Core i7 laptop with Windows 7 using the .NET Framework 4.5, the entire process runs in 4.2 s, and the relative standard deviation among the threads' respective execution times is 2%.

However, when run on the same machine (and operating system) using Mono 2.10.9, the overall execution time shoots up to 18 s. There is a huge variance among the respective threads’ performances, with the fastest completing in just 5.6 s whilst the slowest takes 18 s. The average is 14 s, and the relative standard deviation is 28%.

The cause does not appear to be thread scheduling. Pinning each thread to a distinct core (by calling BeginThreadAffinity and SetThreadAffinityMask) does not have any significant effect on the threads’ durations or variances.

Similarly, running the computation on each thread multiple times (and timing them individually) also gives seemingly ad hoc durations. Thus, the issue doesn’t appear to be caused by per-processor warm-up times either.

What I did find to make a difference was pinning all 8 threads to the same processor. In this case, the overall execution was 25 s, which is only 1% slower than executing 8× the work on a single thread. Furthermore, the relative standard deviation also dropped to under 1%. Thus, the issue lies not in Mono's multithreading per se, but in its multiprocessing.

Does anyone have a solution on how to fix this performance issue?

static long limit = 1L << 26;

static long[] results;
static TimeSpan[] timesTaken;

internal static void Main(string[] args)
{
    int processorCount = Environment.ProcessorCount;

    Console.WriteLine("Thread count: " + processorCount);
    Console.WriteLine("Number of points per thread: " + limit.ToString("N0"));

    Thread[] threads = new Thread[processorCount];            
    results = new long[processorCount];
    timesTaken = new TimeSpan[processorCount];

    for (int i = 0; i < processorCount; ++i)
        threads[i] = new Thread(ComputeMonteCarloPi);

    Stopwatch stopwatch = Stopwatch.StartNew();

    for (int i = 0; i < processorCount; ++i)
        threads[i].Start(i);

    for (int i = 0; i < processorCount; ++i)
        threads[i].Join();

    stopwatch.Stop();

    double average = results.Average();
    double ratio = average / limit;
    double pi = ratio * 4;

    Console.WriteLine("Pi: " + pi);

    Console.WriteLine("Overall duration:   " + FormatTime(stopwatch.Elapsed));
    Console.WriteLine();

    for (int i = 0; i < processorCount; ++i)
        Console.WriteLine("Thread " + i.ToString().PadLeft(2, '0') + " duration: " + FormatTime(timesTaken[i]));

    Console.ReadKey();
}

static void ComputeMonteCarloPi(object o)
{
    int processorID = (int)o;

    Random random = new Random(0);
    Stopwatch stopwatch = Stopwatch.StartNew();

    long hits = SamplePoints(random);

    stopwatch.Stop();

    timesTaken[processorID] = stopwatch.Elapsed;
    results[processorID] = hits;
}

private static long SamplePoints(Random random)
{
    long hits = 0;

    for (long i = 0; i < limit; ++i)
    {
        double x = random.NextDouble() - 0.5;
        double y = random.NextDouble() - 0.5;

        if (x * x + y * y <= 0.25)
            hits++;
    }

    return hits;
}

static string FormatTime(TimeSpan time, int padLeft = 7)
{
    return time.TotalMilliseconds.ToString("N0").PadLeft(padLeft);
}

Output on .NET:

Thread count: 8
Number of points per thread: 67,108,864
Pi: 3.14145541191101
Overall duration:     4,234

Thread 00 duration:   4,199
Thread 01 duration:   3,987
Thread 02 duration:   4,002
Thread 03 duration:   4,032
Thread 04 duration:   3,956
Thread 05 duration:   3,980
Thread 06 duration:   4,036
Thread 07 duration:   4,160

Output on Mono:

Thread count: 8
Number of points per thread: 67,108,864
Pi: 3.14139330387115
Overall duration:    17,890

Thread 00 duration:  10,023
Thread 01 duration:  13,203
Thread 02 duration:  14,776
Thread 03 duration:  15,564
Thread 04 duration:  17,888
Thread 05 duration:  16,776
Thread 06 duration:  16,050
Thread 07 duration:   5,561

Output on Mono, with all threads pinned to same processor:

Thread count: 8
Number of points per thread: 67,108,864
Pi: 3.14139330387115
Overall duration:    25,260

Thread 00 duration:  24,704
Thread 01 duration:  25,191
Thread 02 duration:  24,689
Thread 03 duration:  24,697
Thread 04 duration:  24,716
Thread 05 duration:  24,725
Thread 06 duration:  24,707
Thread 07 duration:  24,720

Output on Mono, single thread:

Thread count: 1
Number of points per thread: 536,870,912
Pi: 3.14153660088778
Overall duration:    25,090
Douglas
  • 53,759
  • 13
  • 140
  • 188

1 Answers1

5

Running with mono --gc=sgen fixed it for me, as expected (using Mono 3.0.10).

The underlying issue is that thread-local allocation for the Boehm garbage collector requires some special tuning when used in conjunction with typed allocation or large blocks. This is not only somewhat non-trivial but also has some downsides: you either make marking more complicated/expensive or you require one freelist per thread and type (well, per memory layout).

Thus, by default, the Boehm GC only supports completely pointer-free memory areas or areas where every word can be a pointer, up to a maximum of 256 bytes or so.

But without thread-local allocation, each allocation acquires a global lock, which becomes a bottleneck.

The SGen garbage collector is custom-written for Mono, specifically designed to work fast in a multi-threaded sytem, and does not have these problems.

Reimer Behrends
  • 8,600
  • 15
  • 19
  • Thanks! I'll try it out and report back. – Douglas Jul 09 '13 at 18:17
  • Using `--gc=sgen` on Mono 2.10.9 did not make a difference. I would have been quite surprised if this were really a GC issue, since the compute-intensive loop purposely does not instantiate *any* objects (and "empty" GC collections should not result in 10+ seconds of slow-down). I'll try to get my hands on Mono 3. – Douglas Jul 09 '13 at 19:01
  • I've installed [Mono 3.0.10](http://stackoverflow.com/a/17556335/1149773), and `mono-sgen.exe` does seem to fix the performance issue (although Mono crashes with "Windows systems haven't been ported to support mono_thread_state_init_from_handle" at the program's end). For some reason, `mono.exe --gc=sgen` doesn't work (or crash). – Douglas Jul 09 '13 at 19:44
  • There's a significant and reproducible difference for me between `mono` and `mono --gc=sgen` in 3.0.10. Note that while the loop may appear to not have any allocations, some may occur implicitly in `random.NextDouble()`, e.g. through boxing -- if you replace that call by a dummy routine that does some work and returns a constant, the speed difference goes away, too. – Reimer Behrends Jul 09 '13 at 19:58
  • On a slightly unrelated note, I noticed that each of your processing threads is using the same random number seed (0). Each thread will therefore operate on the same stream of psuedo-random numbers, meaning your final answer will be no different from running a single thread. – RogerN Jul 09 '13 at 20:14
  • @ReimerBehrends: The Windows distribution for Mono 3.0.10 still seems to be buggy (and not officially supported), so I assume `mono.exe` misses the `--gc=sgen` argument (which should otherwise have the same effect as running `mono-sgen.exe`). The internal implementation of `Random.NextDouble` does not appear to contain any object instantiations (see [`Random.cs`](https://github.com/mono/mono/blob/master/mcs/class/corlib/System/Random.cs)); and replacing the `Random` class with a custom integer-based one didn't affect the performances. – Douglas Jul 09 '13 at 20:24
  • @ReimerBehrends: Anyway, thanks for the tests. I just need to verify that it works on Linux tomorrow; if it does, I'll accept this answer. – Douglas Jul 09 '13 at 20:25
  • @RogerN: Yes, I intentionally used a fixed seed to eliminate the remote possibility that the performance discrepancies were caused by differences in the pseudo-random number sequences. I'll be using an `Interlocked.Increment`ed seed for my proper implementation. – Douglas Jul 09 '13 at 20:27
  • @Douglas - yes, this is a bit puzzling. I've just filed a bug with Xamarin. I can't help you with the Windows issue, I'm afraid, as I'm using OS X and Linux myself. On OS X/Linux, `mono --gc=sgen` does an exec() of `mono-sgen`, which is why it may not work on Windows, but that's a guess. – Reimer Behrends Jul 09 '13 at 20:44
  • @ReimerBehrends: Just verified the fix on Linux, and it works like you said (with the `--gc=sgen` flag). Thanks a lot, this solved a huge issue for me :-) Please post back if you hear anything from Xamarin about it. – Douglas Jul 10 '13 at 14:32