3

I enumerate through a bitarray setting every second bit to false.

Now I'd like to speed this up by splitting it up into two threads.. for some weird reason though, the time per Thread to do the HALF amount of work takes 64% MORE time, and I wonder why's that?

Could this be due to some kind of CPU caching effect? How do I do this properly?

I have tried 8 threads too previously with lambda expressions but it was always around ~1400 ms, however in single threading I constandly got 850 ms. Also when I let a single thread do all the work, it took me 830 ms. I just don't understand, anyone knowing the cause for that here?

Code:

    class Program
    {
        static int count = 0x10000000;
        static int half = count / 2;
        static BitArray bitArray = new BitArray(count);

        static unsafe void Main(string[] args)
        {
            Stopwatch sw = Stopwatch.StartNew();

#if SINGLE
            for (int i = 0; i < bitArray.Count; i += 2)
                bitArray.Set(i, true);
#else
            Thread thread1 = new Thread(Thread1);
            Thread thread2 = new Thread(Thread2);
            thread1.Start();
            thread2.Start();
            thread1.Join();
            thread2.Join();
#endif
            sw.Stop();

            Console.WriteLine(sw.ElapsedMilliseconds);
            Console.ReadLine();
        }

        static void Thread1()
        {
            Stopwatch sw = Stopwatch.StartNew();
            for (int i = 0; i < half; i += 2)
                bitArray.Set(i, true);
            sw.Stop();
            Console.WriteLine("Thread1: {0}", sw.ElapsedMilliseconds);
        }

        static void Thread2()
        {
            Stopwatch sw = Stopwatch.StartNew();
            for (int i = half; i < count; i += 2)
                bitArray.Set(i, true);
            sw.Stop();
            Console.WriteLine("Thread2: {0}", sw.ElapsedMilliseconds);
        }
    }
Patrick Desjardins
  • 136,852
  • 88
  • 292
  • 341
hl3mukkel
  • 5,369
  • 3
  • 21
  • 21
  • 2
    How many CPU cores do you have? – Jesse C. Slicer Aug 08 '13 at 00:35
  • 2
    Have you done any sort of profiling yet? Also, why is your code unsafe? – Tombatron Aug 08 '13 at 00:37
  • I have 4 cores with 8 Threads, Core i7 2600-K @ 4.0 Ghz – hl3mukkel Aug 08 '13 at 00:37
  • @Tombatron oh that was just left over, you can remove it as it's not nesscary here, I am trying to do the profiling with the Stopwatches, I think it's fairly accurate, isn't it? – hl3mukkel Aug 08 '13 at 00:38
  • 1
    You don't seem to be concerned much by the fact that BitArray is not a thread-safe class? – jods Aug 08 '13 at 00:40
  • I know it's not, however it uses int's under the hood and thus as long as I don't specify two regions which do not intersect with each other on a 32-bit pattern, I don't see a reason of how this should be bad – hl3mukkel Aug 08 '13 at 00:41
  • How much each thread takes? – Yochai Timmer Aug 08 '13 at 00:44
  • About 1400 too, however if I leave all the work to one thread it takes ~830 ms (850 ms in the main thread) – hl3mukkel Aug 08 '13 at 00:45
  • i think this is the answer for you - http://stackoverflow.com/questions/2902264/threading-vs-single-thread - the accepted answer is a great analogy by eric lipert. Also this might interest you - http://msdn.microsoft.com/en-us/library/ms810437.aspx – terrybozzio Aug 08 '13 at 00:45
  • @hl3mukkel as I explained in my answer, the fact that you think there is no intersection doesn't mean it's true, as you don't know how the class is implemented. A BitArray is also optimized, so that 32 bits are stored in a single word. If your half is in the middle of a word, your threads will clash. – Giulio Franco Aug 08 '13 at 00:49
  • @Giulio Franco I do know how the class is implemented, I have used DotPeek to decompile the class and saw how it was implemented, 0x10000000 divided by 32 gives me 0x800000, now half that and you have 0x400000, without any remainder. my for loops don't intersect as the second thread starts with the exclusive upper bound of the other one, and the int's aren't intersecting either, so that's not a problem, although corruption is a problem when they do, but in this example they do not. – hl3mukkel Aug 08 '13 at 00:53
  • @terrybozzio Thanks, reading the article at the moment, seems interesting – hl3mukkel Aug 08 '13 at 00:53

3 Answers3

9

BitArray is not a thread-safe class. You should not use it like that. And in fact, beside correctness this is most probably the cause of slowness. Here's why:

If you look at the source code of BitArray, it contains an int version field, which is updated at every operation, notably Set(), which you call.

This means that every thread continuously updates the same memory location, which is a huge perf killer because all cores have to communicate and synchronize when accessing this location. In this condition it makes perfect sense that a multithreaded solution has worse performance than the single core one.

jods
  • 4,581
  • 16
  • 20
  • Thanks for your response! That makes sense.. Do you think recreating the class removing the int version field would give better performence? – hl3mukkel Aug 08 '13 at 00:59
  • Removing the shared `version` field is clearly a must if you want an efficient parallel algorithm. But performance is difficult to predict, maybe there will be other bottlenecks (e.g. false sharing), you have to test and benchmark. – jods Aug 08 '13 at 01:02
  • ok, cheers! I assume I'll just leave it single threaded then, thanks for your effort! – hl3mukkel Aug 08 '13 at 01:03
  • @hl3mukkel by implementing my own BitArray, I got ~1800ms on single thread and ~1300ms on 2 threads. With the BitArray implementation provided by Mono (I can't use Windows right now), I got ~2300ms on single thread and ~3400ms on multithread. (my implementation is really really poor, that's why it's faster even in single thread) – Giulio Franco Aug 08 '13 at 15:39
  • I just copied it and tried it again and yeah the version variable seems to be the bottleneck, sorry if I'm annoying though but "because all cores have to communicate and synchronize", doesn't the CPU do that only unless it's told to? Like with a lock or so? I am not questioning your answer, but I don't get why it syncronizes without being told to? like with Monitor.Lock and why does that give a bad performence? wouldn't it rather make a "race condition"? I am working on a multithreaded version of it now – hl3mukkel Aug 09 '13 at 16:57
  • @hl3mukkel: The main memory is a shared resource by all cores. What you write there is visible to other cores. If you had to explicitely say so, that would be software transactional memory (STM). Because memory access is the main bottleneck nowadays, CPU hide its latency behind several levels of caches. If you modify a memory location that is inside the cache of another core, it has to be invalidated. This requires communication between the caches (not free) and will result in a cache-miss later on (expensive). – jods Aug 14 '13 at 12:20
  • @hl3mukkel: additionnally such performance problems may even arise when you don't share a single memory location, but several memory locations that happen to share the same cache slot. This is called 'false sharing', you can look it up. – jods Aug 14 '13 at 12:22
  • ah! I understand, than you very much! – hl3mukkel Aug 16 '13 at 19:49
2

Because threading isn't so easy as it seems.

First of all, as stated by the documentation, BitArrays are NOT thread-safe. This means they might and will behave unpredictably, when used concurrently by multiple threads.

As for the performance penalty, it's probably due to the increased bus traffic, which is caused by your two threads, trying to concurrently modify the same memory locations, continuosly invalidating each others' caches.

Even though you think your threads are not modifying the same locations, that might not be true. For instance, BitArray has got a Count property. It's very likely that, each time you set a bit to 1, the thread modifies a counter variable, that's backing the Count property. This concurrent modification is dangerous, due to race conditions and stale values, and might increase the bus traffic, as I described before.

The matter is that a CPU core works at 2-3GHz, while the RAM and the memory bus work at 1Ghz. The ram is much slower, a RAM access requires about 100 CPU cycles. If you break the caching mechanisms of the CPU, it's obvious that the performance will decrease.

EDIT: not to mention that thread creation and joining involves a significant overhead. If your work is 830ms one-shot. It's unlikely that you can obtain significant improvements with multithreading. You should also try to get rid of the Stopwatches in the threads, because they are an overhead, too.

Giulio Franco
  • 3,170
  • 15
  • 18
  • 1
    You're close: `Count` is the length of the bit array it doesn't change when calling `Set`. On the other hand there is a `version` private field, whose purpose is to break enumerators if the content of the bit array has changed. – jods Aug 08 '13 at 00:50
  • Thanks for your response, however "trying to concurrently modify the same memory locations" this is not exactly true as I split the work up to 0 - (exclusve) half and half - count, and as said previously the Count is not affected, it's not setting anything other than accessing the internal int[] m_array, I agree that caching might be a concern, however in the sandy bridge series each core has an L1 & L2 cache for themselves and a shared L3 cache.. maybe that's the cause, I am not sure – hl3mukkel Aug 08 '13 at 00:58
  • Also, if Stopwatches are not recommended, what do you think should be used instead? the included Performence Analyser of Visual Studio? – hl3mukkel Aug 08 '13 at 01:02
  • @hl3mukkel it's not that Stopwatch is not recommended. It's that the fact itself that you are creating an object, and calling two methods on it, is a slowdown that may significantly affect your relative results, if we are talking of running times of less than a second. The performance analyser would be more objective between the two versions. – Giulio Franco Aug 08 '13 at 01:05
  • ok, that makes sense however the input size is really large (2^27) so I thought small constants like that won't matter too much, you're right of course though as that takes performence too – hl3mukkel Aug 08 '13 at 01:08
1

I modified your code so that the test runs 10 times and reports the results. Using your code, I see similar timings for the single vs multithreaded tests (each thread is taking around 1200ms).

However, as others have said, your use of a single BitArray from multiple threads is not guaranteed to not cause contention between the threads.

This is most simply demonstrated by giving each thread its own BitArray instead of using a shared static one. With this approach, I typically see each thread taking around 450ms, although occasionally still seeing larger times:

Thread2: 415
Thread1: 420
447
Thread2: 414
Thread1: 420
496
Thread1: 1185
Thread2: 1198
1249
Thread1: 417
Thread2: 421
455
Thread1: 420
Thread2: 415
455
Thread2: 413
Thread1: 417
491
Thread2: 413
Thread1: 417
508
Thread2: 417
Thread1: 441
526
Thread1: 420
Thread2: 415
465
Thread1: 940
Thread2: 1005
1087

Ultimately I think what this is showing is that:

  • Despite the code design, there are still contention effects in the BitArray between the threads
  • Even with individual bit arrays per thread, there are still "random" effects in the timing of the code, which shows that with microbenchmarking like this, you are always effectively benchmarking a lot more than just the code you've written. You also have effects from GC, CPU cache, context switching, core hopping, stopwatch inaccuracies, etc etc.
  • If the real aim of the code you're trying to write is to stuff bit arrays as quickly as possible, it's probable that you'll want a closer to the wire, more manual approach, possibly in another language.
Niall Connaughton
  • 15,518
  • 10
  • 52
  • 48
  • I agree! I am working on a multithreaded version now, the int version seems to cause a problem as it trys to sync it – hl3mukkel Aug 09 '13 at 16:58