2

I have a program with several threads, one thread will change a global when it exits itself and the other thread will repeatedly poll the global. No any protection on the globals. The program works fine on uni-processor. On dual core machine, it works for a while and then halt either on Sleep(0) or SuspendThread(). Would anyone be able to help me out on this?

The code would be like this:

Thread 1:

do something...
while(1)
{
.....
flag_thread1_running=false;
SuspendThread(GetCurrentThread());
continue;

}

Thread 2
flag_thread1_running=true;
ResumeThread(thread1);
.....do some other work here....
while(flag_thread1_running) Sleep(0);
....
Suma
  • 33,181
  • 16
  • 123
  • 191
  • 6
    Never use SuspendThread unless you're a debugger. Instead, use appropriate synchronization objects (events, mutexes, semaphores, etc) to control what threads are running. – 500 - Internal Server Error May 04 '10 at 00:25
  • Hi Larsen, any reason on this? Would appreciate if you can give me some hints – Shangping Guo May 04 '10 at 00:46
  • 1
    This doesn't make any sense. Where is the flag ever set to `true`? Why is there a bare `continue;` at the end of the `while` loop? If you produce a minimal *compilable* program that demonstrates the problem then you might get better help. – caf May 04 '10 at 00:54
  • 1
    thread 2 would be like this: flag_thread1_running=true; ResumeThread(thread1); .....do some other work here.... while(flag_thread1_running) Sleep(0); The continue statement let the thread start from the while(1) after it is resumed – Shangping Guo May 04 '10 at 01:13
  • 1
    Can you please rephrase the question ? I am not sure what you are trying to do. – Romain Hippeau May 04 '10 at 01:39
  • Well that's not what you have - you have `while (flag_thread1_running==false) Sleep(0);`, which is the opposite. – caf May 04 '10 at 02:19
  • Yes, you are right. My code should be: while(flag_thread1_running) Sleep(0); I am not posting the real code since they contains much more other work which may make the things more complicated – Shangping Guo May 04 '10 at 16:53

4 Answers4

14

The fact that you don't see any problem on a uniprocessor machine, but see problems on a multiproc machine is an artifact of the relatively large granularity of thread context switching on a uniprocessor machine. A thread will execute for N amount of time (milliseconds, nanoseconds, whatever) before the thread scheduler switches execution to a different thread. A lot of CPU instructions can execute in the typical thread timeslice. You can think of it as having a fairly large chunk of "free play" exclusive processor time during which you probably won't run into resource collisions because nothing else is executing on the processor.

When running on a multiproc machine, though, CPU instructions in two threads execute exactly at the same time. The size of the "free play" chunk of time is near zero.

To reproduce a resource contention issue between two threads, you need to get thread 1 to be accessing the resource and thread 2 to be accessing the resource at the same time, or very nearly the same time.

In the large-granularity thread switching that takes place on a uniprocessor machine, the chances that a thread switch will happen exactly in the right spot are slim, so the program may never exhibit a failure under normal use on a uniproc machine.

In a multiproc machine, the instructions are executing at the same time in the two threads, so the chances of thread 1 and thread 2 accessing the same resource at the same time are much, much greater - thousands of times more likely than the uniprocessor scenario.

I've seen it happen many times: an app that has been running fine for years on uniproc machines suddenly starts failing all over the place when executed on a new multiproc machine. The cause is a latent threading bug in the original code that simply never hit the right coincidence of timeslicing to repro on the uniproc machines.

When working with multithreaded code, it is absolutely imperitive to test the code on multiproc hardware. If you have thread collision issues in your code, they will quickly present themselves on a multiproc machine.

As others have noted, don't use SuspendThread() unless you are a debugger. Use mutexes or other synchronization objects to coordinate between threads.

dthorpe
  • 35,318
  • 5
  • 75
  • 119
3

Try using something more like WaitForSingleObjectEx instead of SuspendThread.

bta
  • 43,959
  • 6
  • 69
  • 99
  • Definitely. SuspendThread / ResumeThread used like this can lead to hard to debug race conditions, which hit you much more frequently on multi-CPU systems. Using proper synchronization primitives is a must. See also http://stackoverflow.com/questions/131818/is-putting-thread-on-hold-optimal – Suma May 04 '10 at 18:13
2

You are hitting a race condition. Thread 2 may execute flag_thread1_running=true; before thread 1 executes flag_thread1_running=false.

This is not likely to happen on single CPU, because with usual the scheduling quantum 10-20 ms you are not likely to hit the problem. It will happen there as well, but very rarely.

Using proper synchronization primitives is a must here. Instead of bool, use event. Instead of checking the bool in a loop, use WaitForSingleObject (or WaitForMultipleObjects for more elaborate stuff later).

It is possible to perform synchronization between threads using plain variables, but it is rarely a good idea and it is quite hard to do it right - cf. How can I write a lock free structure?. It is definitely not a good idea to perform schedulling using Sleep, Suspend or Resume.

Community
  • 1
  • 1
Suma
  • 33,181
  • 16
  • 123
  • 191
  • I just tried the WaitforsingleObject and create an event and abandoned the polling mechanism. It so far works fine on the dual core machine. Thank you for the useful hints – Shangping Guo May 04 '10 at 20:00
0

I guess that you already know that polling a global flag is a "Bad Idea™" so I'll skip that little speech. Try adding volatile to the flag declaration. That should force each read of it to read from memory. Without volatile, the implementation could be reading the flag into a register and not fetching it from memory.

D.Shawley
  • 58,213
  • 10
  • 98
  • 113
  • Thank you for the comments. I have defined the globals to be 'volatile' and it is no use – Shangping Guo May 04 '10 at 00:31
  • Actually I've seen different opinions on using volatile keywords. Someone says it is no use and will make the program much slower. So far from my practice I did not see any changes. – Shangping Guo May 04 '10 at 00:34
  • Why don't you just use an atomic variable? – WhirlWind May 04 '10 at 00:35
  • Hello WhirlWind, Thanks for the suggestion. I am not experienced in multithread programming yet, however, according to my opinion, there is no need to do the atomic operation on the global. There is no concurrent write to the global. Kindly correct me if I am wrong – Shangping Guo May 04 '10 at 00:45
  • 1
    Funny thing: at least on x86, atomic set and read operations compile to normal memory instructions. But the atomic types are coded so that the compiler won't optimize around their operations. – Karmastan May 04 '10 at 03:06
  • @karmastan: It could be that the memory writes are indeed already atomic for most types. Disabling the compiler optimizations is what I was after since the flag is most likely cached in a register somewhere and not being read from memory. – D.Shawley May 04 '10 at 12:11
  • anyone noticed the fact that the program runs good on uni-processor machine but deadlock on the dual-core machine? Any hints on this? – Shangping Guo May 04 '10 at 16:58
  • @ShangpingGuo: my guess is that it is a problem with coherency in the memory cache see http://en.wikipedia.org/wiki/Cache_coherence for some of the details. I would recommend using `CreateEvent()`, `SetEvent()`, and `WaitForSingleObject()` to implement proper communication between your threads instead of using `SuspendThread()`. The MSDN document on `SuspendThread` does mention that it is **not to be used for thread synchronization**. – D.Shawley May 04 '10 at 20:05