5

I'm working on the multithreading implementation of a library. In one module of this library there are some global variables (very often used in the program execution). In order to make the access to those variables more safe, I declared them using the Thread-local storage (TLS) keyword __declspec(thread).

Here is the call to the library external function. This function uses the module with the global variables:

for(i = 0; i<n_cores; i++)
    hth[i] = (HANDLE)_beginthread((void(*)(void*))MT_Interface_DimenMultiCells,0,(void*)&inputSet[i]);

In this way I guess all the variables used in the library will be duplicated for each thread.

When I run the program on a x8 cores processor, the time required to complete the operation doesn't go further than 1/3 the time needed for the single process implementation.

I know that it is impossible to reach 1/8 of the time, but i thought that at least 1/6 was reachable.

The question is: are those __declspec(thread) variables the cause of so bad performances?

Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
Beppe
  • 381
  • 2
  • 7
  • 15
  • Are you still on StackOverflow? If so, perhaps you could let us know what your conclusion was: Did you measure the performance with and without thread local variables? Is so, what difference did it make? Or did something else solve the problem? – PJTraill May 22 '15 at 12:27

2 Answers2

6

If you declare them as __declspec(thread) where they were previously global, then you have changed the meaning of the program, as well as its performance characteristics.

When the variable was a global there was a single copy that each thread referred to. As a thread local, each separate thread has its own variable and changes to that thread local variable are only visible in that thread.

Assuming that you really want thread local then it is true that reading and writing thread local variables is more expensive than normal variables. Whenever you are faced with an operation that takes a long time to perform, the best solution is to stop doing it at all. In this case there are two obvious ways to do so:

  1. Pass the variable around as a parameter so that it resides on the stack. Accessing stack variables is quick.
  2. If you have functions that read and write this variable a lot, then take a copy of it at the start of the function (into a local variable), work on that local variable, and then on return, write it back to the thread local.

Of these options the former is usually to be preferred. Option 2 has the big weakness that it can't easily be applied if the function calls another function that uses this variable.

Option 1 basically amounts to not using global variables (thread locals are a form of global).

This all may be completely wide of the mark of course, because you have said so little about what your code is actually doing. If you want to solve a performance problem, you first have to identify where it is, and that means you need to measure.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • Thank you, I'm now pretty sure that the problem resides in the thread local variabile access time. I should try to implement the solution "1" and then test again the performances. – Beppe Feb 22 '11 at 11:19
5

And the answer is: you need to profile the application, and measure where the most time is being spent. If it turns out to be in functions that often reference the TLS data, then "maybe" could be the answer.

It's generally very hard to pick out the reasons for bad performance even in code you've written yourself: doing it remotely in a program described in two short paragraphs is even harder.

Profile, then optimize.

unwind
  • 391,730
  • 64
  • 469
  • 606
  • Thank you for the answer. The TLS data are used in a very time consuming set of operation. But from your words, I understand that this fact could no be the cause. I would post the code, but it's too much long, that's why I just tried to shortly describe my problem. – Beppe Feb 22 '11 at 10:59