In the previous four articles in my Unintuitive Multithreading series, I focused mostly on design of a multi-threaded application. This time, I plan to cover a few techniques that I have found helpful in troubleshooting multi-threaded applications.
To some extent, the normal techniques you have used for troubleshooting will work in a multi-threaded program. But, some of the problems in a multihreaded program will defy your normal tactics. A large fraction of these problems will turn out to be race conditions. By their nature, race conditions are hard to duplicate and are not easy to reason about. There are a set of models that I have found that sometimes shed light on these kinds of problems. These models are:
- Lock Step
- Leap Frog
- Simultaneous Execution
- Instruction Interruption
Each of these approaches helps you to visualize ways in which threads may interact to cause you problems.
Lock Step
The Lock Step model is fairly simple. Pick two threads that might interact and mentally execute them one statement at a time, thread A, then thread B, then A, then B… Some of the most classic race conditions are instantly identifiable with this technique. For example, in trying to avoid the cost of a mutex, some programmers new to multithreading will try something like this.
static bool locked = false;
while(locked) ; // wait on the lock
locked = true;
do_critical();
locked = false;
The lock step scenario shows the fallacy here immediately.
- Thread A executes the
while
. - Thread B executes the
while
. - Thread A sets
locked
totrue
. - Thread B sets
locked
totrue
. - Thread A executes the
do_critical()
. - Thread B executes the
do_critical()
. - Thread A sets
locked
tofalse
. - Thread B sets
locked
tofalse
.
Without an understanding of multi-threading, a junior programmer might convince himself that the code would work. And it will. Most of the time. Since it has to work all of the time, we have a problem.
Let’s take this example a little farther. Say our junior programmer decides to get a bit more clever.
static int locked = 0;
while(locked++) // wait on the lock
--locked;
do_critical();
locked = 0;
non_critical()
This won’t fall to the lock step model from before, but it will cause problems if we lockstep with an offset.
- Thread A executes the
do_critical()
. - Thread B executes the
while(locked++)
. - Thread A sets
locked
to0
. - Thread B decrements
locked
to-1
. - Thread A executes the
non_critical()
. - Thread B executes the
while(locked++)
and continues to be locked out.
Walking through the code in lock step with a couple of threads can point out subtle timing issues in your MT code. The threads don’t have to be executing the same code for this approach to work. There can be an offset as in the second example, or the threads could be executing completely separate pieces of code.
It should be obvious that if you don’t do the lock step exactly correct, you may miss a race condition. This is why most race conditions succeed most of the time. The key to using this technique is to walk through the suspected code with an awareness for ways that things might go wrong. If something looks suspicious, change the offset between the threads and walk through again. You’ll often find that you become more and more suspicious of a piece of code as you get closer to the right timing.
Leap Frog
The Leap Frog model is similar to lock step except that you execute more than one instruction at a time, with first one thread leading then the other leading. This approach is often useful in cases where using lock step had made you suspicious of some code but had not actually triggered a race condition.
if(ptr)
{
void *tmp = ptr;
ptr = 0;
free( tmp );
}
- Thread A tests ptr.
- Thread B tests the ptr.
- Thread B saves
ptr
intmp
. - Thread B resets
ptr
to null. - Thread A saves
ptr
intmp
. - Thread A resets
ptr
to null. - Thread A executes the
free()
on a null pointer. - Thread B executes the
free()
on the real pointer.
Running this in lockstep would have spotted a potential double free. Looking at this code in leap frog mode shows a different problem. It is possible to free()
a null pointer which is undefined behavior in C and C++. Depending on which symptom you are trying to match, different scenarios help you to see different things.
Simultaneous Execution
On a single CPU machine, you are guaranteed that the CPU can only be executing 1 thread at a time. Even though it appears that two or more threads are executing simultaneously, the reality is that we are executing a little of one, then a little of the next, and so on. The machine loops through the list of runnable threads often enough that they all appear to be running at once to us.
On a multiple CPU machine, this is no longer true. You can have, at most, one thread per CPU executing at the same time. In other words, on a dual CPU machine, two threads can execute the following expression at exactly the same time.
unlocked++;
This can lead to a situation even more pathological than the lock step case we explored above. This is why some “multi-threaded” code works fine until the day it is run on a multiple CPU machine for the first time. There may have been race conditions that did not manifest because no threads were actually executing at exactly the same time. Once that situation occurred, these new race conditions manifested themselves.
Instruction Interruption
Many new MT programmers make the implicit assumption that each C expression is atomic. Actually, even on a single processor machine, we can be interrupted between any two assembler instructions. Here’s an example to show what I mean.
static int initialized = 0;
struct singleton onlyOnce;
if(!initialized++)
{
initialize( &onlyOnce );
}
The idea is that we only want to initialize the onlyOnce
structure the first time through this code. The problem is that the if
is much more complicated than it looks. In a kind of pseudo-assembler, the if
would look something like this:
move reg1, initialized
add reg2, 1, reg1
test reg1, 0
jumpne label
This piece of code could be interrupted before it starts, after it ends, or at any of the three points between instructions. Almost every C or C++ statement transforms into a set of instructions like this. Even on a single CPU machine, that means that you can be interrupted in the middle of almost any expression. Unlike many junior programmers seem to believe, ++i
is no more atomic an operation than i = i + 1
. (At least, you are not guaranteed that it will be any different.)
Summary
To summarize, many of the hardest problems to troubleshoot in a multithreaded application are race conditions. Race conditions are particularly hard to find, because they involve the interaction between two threads. This article lists four different ways to approach these interactions to help you see potential race conditions.