Ways to Find a Race Condition

Question

I have a bit of code with a race condition in it... I know that it is a race condition because it does not happen consistently, and it seems to happen more often on dual core machines.

It never happens when I'm tracing. Although, there is a possibility that it could be a deadlock as well. By analyzing stages of completion of logs where this does and does not occur, I've been able to pinpoint this bug to a single function. However, I do not know where in the scope of the function this is happening. It's not at the top level.

Adding log statements or breakpoints is going to change the timing if it is a race condition, and prevent this from happening.

Is there any technique that I can use aside from getting a race condition analyzer that will allow me to pinpoint where this is happening?

This is in visual studio 9, with C++ (of the nonmanaged variety).

Last time I had a serious race condition, I knew locally where it was occurring. I did it the "old fashioned way" and resorted to graphing call trees and highlighting lock durations for each call by hand. In my case it was relegated to 2 source files and a handful of functions, but it proved invaluable. — Nathan Ernst, Jun 29 '10 at 01:13

score 11 · Answer 1 · answered Jan 15 '16 at 10:01

There is a tool included in CLang and gcc 4.8+ called ThreadSanitizer.

You compile your code using the -fsanitize=thread flag

Example:

$ cat simple_race.cc
#include <pthread.h>
#include <stdio.h>

int Global;

void *Thread1(void *x) {
  Global++;
  return NULL;
}

void *Thread2(void *x) {
  Global--;
  return NULL;
}

int main() {
  pthread_t t[2];
  pthread_create(&t[0], NULL, Thread1, NULL);
  pthread_create(&t[1], NULL, Thread2, NULL);
  pthread_join(t[0], NULL);
  pthread_join(t[1], NULL);
}

And the output

$ clang++ simple_race.cc -fsanitize=thread -fPIE -pie -g
$ ./a.out 
==================
WARNING: ThreadSanitizer: data race (pid=26327)
  Write of size 4 at 0x7f89554701d0 by thread T1:
    #0 Thread1(void*) simple_race.cc:8 (exe+0x000000006e66)

  Previous write of size 4 at 0x7f89554701d0 by thread T2:
    #0 Thread2(void*) simple_race.cc:13 (exe+0x000000006ed6)

  Thread T1 (tid=26328, running) created at:
    #0 pthread_create tsan_interceptors.cc:683 (exe+0x00000001108b)
    #1 main simple_race.cc:19 (exe+0x000000006f39)

  Thread T2 (tid=26329, running) created at:
    #0 pthread_create tsan_interceptors.cc:683 (exe+0x00000001108b)
    #1 main simple_race.cc:20 (exe+0x000000006f63)
==================
ThreadSanitizer: reported 1 warnings

Thread Sanitizer does **not** detect race conditions. It detects data races which is not necessarily the same thing. Your example shows what the standard calls a data race, not a race condition. See [this](https://stackoverflow.com/questions/49450136/thread-sanitizer-gives-false-negative-for-function-race) to see where it fails to detect a race condition — NathanOliver, Mar 23 '18 at 13:11

score 6 · Accepted Answer · answered Jun 28 '10 at 19:10

6

Put sleeps in various parts of your code. Something that is threadsafe will be threadsafe even if it (or asynchronous code) sleeps for even seconds.

answered Jun 28 '10 at 19:10

Mark Peters

80,126
17
159
190

I know it is quite late to comment on this answer, but is there any example to show how 'put sleeps' can be used to detect race conditions? – james Dec 06 '18 at 13:34
@james: This is a very "whitebox" approach to race condition analysis and was just meant to point out that if you have a race condition which is only "winning" 1% of the time, putting some sleeps in one of the competing threads can make it "win" ~100% of the time, making it easier to diagnose. Since it necessarily requires you to know a lot about (and modify) the code, it's hard to give examples here. – Mark Peters Dec 06 '18 at 15:46

score 2 · Answer 3 · answered Jun 28 '10 at 18:47

Indeed there are some attempts to find race conditions automatically.

Another term I read in conjunction with race condition detection is RaceFuzzer, but I was not able to find really useful information about it.

I think this is a relatively yound field of investigation so there are - as far as i know - mainly theoretic papers about this subject. However, try googling one the above keywords, maybe you will find some useful information.

score 2 · Answer 4 · answered Jun 28 '10 at 18:47

2

The best way I know of to track these down is to use CHESS in Visual Studio. This is not a simple tool to use, and will probably require testing subsections of your app progressively. Good luck.

answered Jun 28 '10 at 18:47

Dour High Arch

21,513
29
75
90

score 2 · Answer 5 · answered Jun 28 '10 at 19:05

I've had some luck with using Visual Studio's tracepoints to find race conditions. Of course it still affects the timing, but in the cases I used it, at least, it wasn't enough to completely prevent the race conditions from occurring. It seemed less disruptive than dedicated logging, at least.

Other than that, try posting the code allowing others to look over it. Just studying the code in detail isn't a bad way to find race conditions.

Rokujolady · Answer 6 · 2010-06-28T23:19:55.533

So, the sledgehammer method for me has been the following, which takes a lot of patience and can in the best case scenario get you on the right track. I used this to figure out what was going on with this particular problem. I have been using tracepoints, one at the beginning of the suspected high-level function, and one at the end. Move the tracepoint down. If adding the tracepoint at the beginning of the function causes your bug to stop happening, move the tracepoint down until you can reproduce the condition again. The idea is that the tracepoint will not affect timing if you place it after the call that eventually triggers unsafe code, but will if you place it before. Also, note your output window. Between what messages is your bug occuring? You can use tracepoints to narrow this range as well.

Once you narrow your bug down to a manageable region of code, you can throw in breakpoints and have a look at what the other threads are up to at this point.

score 1 · Answer 7 · answered Jun 28 '10 at 19:33

It can be also a resource that is not protected, which can explain non-consistent behaviour (especially if on a single core it's working fine and not on dual core). In any case, code review (for both race conditions and non thread-safe source code) can be the shortest path to the solution.

score 0 · Answer 8 · answered Feb 14 '13 at 07:30

0

You can use tools like Intel Inspector which are able to check for certain types of race conditions.

answered Feb 14 '13 at 07:30

fons

4,905
4
29
49

Ways to Find a Race Condition

8 Answers8

Linked