Why does program only work with debugging mode?

Question

I trying to understand lock-free programming, and wrote a lock-free stack:

template <typename T>
class LockFreeStack
{
    struct Node
    {
        std::shared_ptr<T> data;
        Node* next;

        explicit Node(const T& _data)
            : data(std::make_shared<T>(_data)), next(nullptr)
        {}
    };
    std::atomic<Node*> head;

public:
    void push(const T& data)
    {
        auto n{new Node(data)};
        n->next = head.load();
        while (!head.compare_exchange_weak(n->next, n))
            ;
    }

    std::shared_ptr<T> pop(void)
    {
        auto old_head{head.load()};
        while (old_head && head.compare_exchange_weak(old_head, old_head->next))
            ;
        return old_head ? old_head->data : std::shared_ptr<T>{};
    }
};

And two thread for push/pop operation on:

static LockFreeStack<int> global_stack;

And main function:

int main(void)
{
    std::srand(std::time(nullptr));

    std::thread pushing_thread([](void) {
        for (size_t i{}; i < MAX_LENGTH; ++i)
        {
            const auto v{std::rand() % 10000};
            global_stack.push(v);

            std::cout << "\e[41mPoping: " << v << "\e[m" << std::endl;
        }
    });
    std::thread poping_thread([](void) {
        for (size_t i{}; i < MAX_LENGTH; ++i)
        {
            if (auto v{global_stack.pop()}; v)
            {
                std::cout << "\e[42mPushing: " << *v << "\e[m" << std::endl;
            }
        }
    });

    pushing_thread.join();
    poping_thread.join();
}

This program only run pushing_thread in Debug mode, but when i run the program with debugger it run both thread as expected or if i wait a moment between the threads:

std::thread pushing_thread(...);
std::this_thread::sleep_for(1s);
std::thread poping_thread(...);

It works properly. So what is happens when we run the program with debugger ?

Compiler: GCC 9.3.
Compiler flags: -std=c++2a -lpthread -Wall.
OS: ArchLinux with linux-5.5.13.

Does it not pop? Does it corrupt your data? Does it hang? Does it crash? — JohnFilleau, Apr 01 '20 at 17:57
@GhasemRamezani So, do you get a segmentation fault? Does it give wrong output? If so what is the output? — walnut, Apr 01 '20 at 17:58
@JohnFilleau, It's done without any problem. But there is no any output with `poping_thread`. — Ghasem Ramezani, Apr 01 '20 at 18:01
@walnut, No there is no any segmentation fault, program down without any problem. Only `pushing_thread` have output on terminal (In debug mode). — Ghasem Ramezani, Apr 01 '20 at 18:03
Note that while this is fun for educational purposes, this kind of design maximizes contention and is unlikely to perform as well as a mutex, particularly on a system with high load. See [this](https://stackoverflow.com/a/7064008/721269) answer. — David Schwartz, Apr 01 '20 at 18:07
@DavidSchwartz I tend to disagree. It depends on the data structure and how many (expensive) atomic operations a lock-free implementation requires. But a simple lock-free stack will almost always outperform a simple mutex based one. You are right that you can have high contention on the head, but this can be reduced by using an elimination backoff. — mpoeter, Apr 03 '20 at 11:19
@mpoeter My experience is the reverse. A simple mutex-based stack almost always outperforms a lock-free stack because a lock-free stack allows contending threads to just keep contending with each other and run slowly indefinitely while a mutex finds the threads that do not contend and makes them run concurrently. Descheduling contending threads so non-contending threads run concurrently is the key to performance on real systems with real load. — David Schwartz, Apr 03 '20 at 14:50
@DavidSchwartz A mutex is used to serialize execution, so threads can never run concurrently. Every good mutex implementation will spin at least for a while before going to sleep. So essentially you move contention on head to contention on the mutex. If a thread cannot acquire the mutex within some time, it will go to sleep. This reduces contention, but has additional overhead due to the syscall. Yes, contention on head can be a problem, but this can be mitigated using different backoff strategies. We can argue endlessly, in the end we would have to run some benchmarks to compare numbers. ;) — mpoeter, Apr 03 '20 at 15:08
@mpoeter Two threads that want to use the same mutex-protected collection contend *once*. The mutex puts one of those threads to sleep and the OS schedules some other thread. Now only one thread that wants to access the collection is running, so it runs at full speed with no system calls for the mutex. A lock-free collection can't do that. A lock-free collection will happily allow two contending threads to run concurrently, fighting for the same resources over and over and over and over. — David Schwartz, Apr 03 '20 at 15:13
@DavidSchwartz Two threads that want to operate on the same lock-free stack also only contend once, namely when one thread changes head in such a way that the second thread has to perform a retry. — mpoeter, Apr 03 '20 at 15:17
@mpoeter No. They do that over and over again each time one of them touches the stack, potentially millions of times in the same timeslice. A lock-free collection isn't smart enough to find the set of ready-to-run threads that don't contend. A mutex is. — David Schwartz, Apr 03 '20 at 15:22
@DavidSchwartz I assume that each thread performs the following step: 1) acquire the mutex; 2) perform a _single_ operation; 3) release the mutex. So when each thread has to perform millions of operations, the mutex is potentially contented _every single time_, just like in the lock-free case. — mpoeter, Apr 03 '20 at 17:04
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/210889/discussion-between-david-schwartz-and-mpoeter). — David Schwartz, Apr 03 '20 at 17:10

score 1 · Answer 1 · answered Apr 01 '20 at 18:02

1

This logic is flawed:

    while (old_head && head.compare_exchange_weak(old_head, old_head->next))
        ;

If old_head is not null and the exchanged worked then try again!

answered Apr 01 '20 at 18:02

Martin York

257,169
86
333
562

mpoeter · Answer 2 · 2020-04-03T11:25:11.090

Your implementation suffers from the so called ABA problem. Consider to the following scenario:

the stack looks like this: A->B->C (i.e., head points to A)
thread 1 loads the head (i.e., the address of A) and A's next pointer (B), but before it can perform the CAS, it is interrupted.
thread 2 pops A, then B, and then pushes A, i.e., the stack now looks like this: A->C
thread 1 resumes and happily updates the head from A to B -> your stack is corrupted - it looks like this: B->C - but B is currently in use by thread 2.

There are several possible solutions to avoid the ABA problem, like tagged pointers or concurrent memory reclamation schemes.

Update:
A tagged pointer is simply a pointer that is extended with a version counter, where the version tag is incremented every time the pointer is updated. You can either use DWCAS (Double-Width-CAS) to update a struct with a separate version field, or you can squeeze the version tag into the upper bits of the pointer. Not all architectures provide DWCAS instructions (x86 does), and it depends on the OS if the upper bits are unused (on Windows and Linux it is usually possible to use the 16 topmost bits).

On the topic of memory reclamation schemes I can refer you to my thesis: Effective memory reclamation for lock-free data structures in C++

@GhasemRamezani I have updated my answer to provide more details on the possible solutions for the ABA problem. — mpoeter, Apr 03 '20 at 11:26

Why does program only work with debugging mode?

2 Answers2