Using ThreadStatic to replace expensive locals -- good idea?

Question

Update: as I should have expected, the community's sound advice in response to this question was to "measure it and see." chibacity posted an answer with some really nice tests that did this for me; meanwhile, I wrote a test of my own; and the performance difference I saw was actually so huge that I felt compelled to write a blog post about it.

However, I should also acknowledge Hans's explanation that the ThreadStatic attribute is indeed not free and in fact relies on a CLR helper method to work its magic. This makes it far from obvious whether it would be an appropriate optimization to apply in any arbitrary case.

The good news for me is that, in my case, it seems to have made a big improvement.

I have a method which (among many other things) instantiates some medium-size arrays (~50 elements) for a few local variables.

After some profiling I've identified this method as something of a performance bottleneck. It isn't that the method takes an extremely long time to call; rather, it is simply called many times, very quickly (hundreds of thousands to millions of times in a session, which will be several hours). So even relatively small improvements to its performance should be worthwhile.

It occurred to me that maybe instead of allocating a new array on each call, I could use fields marked [ThreadStatic]; whenever the method is called, it will check if the field is initialized on the current thread, and if not, initialize it. From that point on all calls on the same thread will have an array all ready to go at that point.

(The method initializes every element in the array itself, so having "stale" elements in the array should not be an issue.)

My question is simply this: does this seem like a good idea? Are there pitfalls to using the ThreadStatic attribute in this way (i.e., as a performance optimization to mitigate the cost of instantiating new objects for local variables) that I should know about? Is the performance of a ThreadStatic field itself perhaps not great; e.g., is there a lot of extra "stuff" happening in the background, with its own set of costs, to make this feature possible?

It's also quite plausible to me that I'm wrong to even try to optimize something as cheap (?) as a 50-element array—and if that's so, definitely let me know—but the general question still holds.

Try it and measure. I suppose that accessing a thread-local item is not free either, though maybe cheaper than reallocation. — 9000, Feb 01 '11 at 16:59
If you're able to use .NET4 then don't forget [`ThreadLocal`](http://msdn.microsoft.com/en-us/library/dd642243.aspx) too. The jury's still out on whether or not it outperforms `ThreadStatic`, but it's a bit easier to use (and to get right). I'd recommend you consider `ThreadLocal` and include it in your benchmarks. — LukeH, Feb 09 '11 at 01:16
For what it's worth, for a serializer I made I found using [ThreadStatic] `StringBuilders` to be significantly faster than creating a new buffer every time. — JulianR, Apr 30 '13 at 02:01
@JulianR: Nice! Yeah, that actually seems like a perfect case—since after a few recycles I would think your pre-allocated per-thread `StringBuilder` objects would have grown to a point where they'd probably not need to resize anymore. Big GC win! — Dan Tao, Apr 30 '13 at 16:23

score 9 · Answer 1 · answered Feb 01 '11 at 17:22

9

[ThreadStatic] is no free lunch. Every access to the variable needs to go through a helper function in the CLR (JIT_GetThreadFieldAddr_Primitive/Objref) instead of being compiled inline by the jitter. It also isn't a true substitute for a local variable, recursion is going to byte. You really have to profile this yourself, guesstimating perf with that much CLR code in the loop isn't feasible.

answered Feb 01 '11 at 17:22

Hans Passant

922,412
146
1,693
2,536

1

You only need to pay that cost once per function call (to grab a ref to the array and store it in a real local) – Yuliy Feb 01 '11 at 17:25
No, that's no free lunch either. The object reference still goes through a stub. – Hans Passant Feb 01 '11 at 17:32

Tim Lloyd · Accepted Answer · 2011-02-01T18:41:02.127

I have carried out a simple benchmark and ThreadStatic performs better for the simple parameters described in the question.

As with many algorithms which have a high number of iterations, I suspect it is a straightforward case of GC overhead killing it for the version which allocates new arrays:

Update

With tests that include an added iteration of the array to model minimal array reference use, plus ThreadStatic array reference usage in addition to previous test where reference was copied local:

Iterations : 10,000,000

Local ArrayRef          (- array iteration) : 330.17ms
Local ArrayRef          (- array iteration) : 327.03ms
Local ArrayRef          (- array iteration) : 1382.86ms
Local ArrayRef          (- array iteration) : 1425.45ms
Local ArrayRef          (- array iteration) : 1434.22ms
TS    CopyArrayRefLocal (- array iteration) : 107.64ms
TS    CopyArrayRefLocal (- array iteration) : 92.17ms
TS    CopyArrayRefLocal (- array iteration) : 92.42ms
TS    CopyArrayRefLocal (- array iteration) : 92.07ms
TS    CopyArrayRefLocal (- array iteration) : 92.10ms
Local ArrayRef          (+ array iteration) : 1740.51ms
Local ArrayRef          (+ array iteration) : 1647.26ms
Local ArrayRef          (+ array iteration) : 1639.80ms
Local ArrayRef          (+ array iteration) : 1639.10ms
Local ArrayRef          (+ array iteration) : 1646.56ms
TS    CopyArrayRefLocal (+ array iteration) : 368.03ms
TS    CopyArrayRefLocal (+ array iteration) : 367.19ms
TS    CopyArrayRefLocal (+ array iteration) : 367.22ms
TS    CopyArrayRefLocal (+ array iteration) : 368.20ms
TS    CopyArrayRefLocal (+ array iteration) : 367.37ms
TS    TSArrayRef        (+ array iteration) : 360.45ms
TS    TSArrayRef        (+ array iteration) : 359.97ms
TS    TSArrayRef        (+ array iteration) : 360.48ms
TS    TSArrayRef        (+ array iteration) : 360.03ms
TS    TSArrayRef        (+ array iteration) : 359.99ms

Code:

[ThreadStatic]
private static int[] _array;

[Test]
public object measure_thread_static_performance()
{
    const int TestIterations = 5;
    const int Iterations = (10 * 1000 * 1000);
    const int ArraySize = 50;

    Action<string, Action> time = (name, test) =>
    {
        for (int i = 0; i < TestIterations; i++)
        {
            TimeSpan elapsed = TimeTest(test, Iterations);
            Console.WriteLine("{0} : {1:F2}ms", name, elapsed.TotalMilliseconds);
        }
    };

    int[] array = null;
    int j = 0;

    Action test1 = () =>
    {
        array = new int[ArraySize];
    };

    Action test2 = () =>
    {
        array = _array ?? (_array = new int[ArraySize]);
    };

    Action test3 = () =>
    {
        array = new int[ArraySize];

        for (int i = 0; i < ArraySize; i++)
        {
            j = array[i];
        }
    };

    Action test4 = () =>
    {
        array = _array ?? (_array = new int[ArraySize]);

        for (int i = 0; i < ArraySize; i++)
        {
            j = array[i];
        }
    };

    Action test5 = () =>
    {
        array = _array ?? (_array = new int[ArraySize]);

        for (int i = 0; i < ArraySize; i++)
        {
            j = _array[i];
        }
    };

    Console.WriteLine("Iterations : {0:0,0}\r\n", Iterations);
    time("Local ArrayRef          (- array iteration)", test1);
    time("TS    CopyArrayRefLocal (- array iteration)", test2);
    time("Local ArrayRef          (+ array iteration)", test3);
    time("TS    CopyArrayRefLocal (+ array iteration)", test4);
    time("TS    TSArrayRef        (+ array iteration)", test5);

    Console.WriteLine(j);

    return array;
}

[SuppressMessage("Microsoft.Reliability", "CA2001:AvoidCallingProblematicMethods", MessageId = "System.GC.Collect")]
private static TimeSpan TimeTest(Action action, int iterations)
{
    Action gc = () =>
    {
        GC.Collect();
        GC.WaitForFullGCComplete();
    };

    Action empty = () => { };

    Stopwatch stopwatch1 = Stopwatch.StartNew();

    for (int j = 0; j < iterations; j++)
    {
        empty();
    }

    TimeSpan loopElapsed = stopwatch1.Elapsed;

    gc();
    action(); //JIT
    action(); //Optimize

    Stopwatch stopwatch2 = Stopwatch.StartNew();

    for (int j = 0; j < iterations; j++) action();

    gc();

    TimeSpan testElapsed = stopwatch2.Elapsed;

    return (testElapsed - loopElapsed);
}

There probably ought to be some use of the arrays as well to make this a realistic comparison. — 500 - Internal Server Error, Feb 01 '11 at 17:43
I've yet to use [ThreadStatic] myself, so I'm no expert, but its equivalent in unmanaged code has overhead for every access. Perhaps that can be mitigated here by copying the [ThreadStatic] to a local before use, but it would be interesting to see some numbers. — 500 - Internal Server Error, Feb 01 '11 at 17:47
@500 I'm not sure if your point was a suggestion or an observation re. "Perhaps that can be mitigated here by copying the [ThreadStatic] to a local before use"? I was already doing that as it felt like a sensible thing to do having used TLS before ;) — Tim Lloyd, Feb 01 '11 at 18:08
I just tested this on net 3.5; The results are significantly worse for ThreadStatic. While the TS-CopyArrayRefLocal Methods are still about 20%-30% faster than the local ones, the TSArrayRef test is almost 100 times slower! So on net3.5 this optimization will not gain you much, and if implemented wrong could hurt performance a lot. — HugoRune, Aug 14 '12 at 11:25

score 2 · Answer 3 · answered Feb 01 '11 at 17:00

From results like this, ThreadStatic looks pretty fast. I'm not sure that anybody has a specific answer to if it's faster then reallocating a 50 element array though. That's the kind of thing you'll have to benchmark yourself. :)

I'm somewhat torn on if it's a "good idea" or not. So long as all the implementation details are kept inside the class it's not necessarily a bad idea (you really don't want the caller to have to worry about it), but unless benchmarks showed a performance gain from this method I would stick to simply allocating the array each time because it makes the code simpler and easier to read. As the more complicated of the two solutions, I'd need to see some benefit from the complexity before choosing this one.

there is hidden cost which is paid later: the "GC" – TakeMeAsAGuest Feb 08 '19 at 19:07 — TakeMeAsAGuest, Feb 08 '19 at 19:07

Using ThreadStatic to replace expensive locals -- good idea?

3 Answers3

Linked