What could cause a mutex to misbehave?

Question

I've been busy the last couple of months debugging a rare crash caused somewhere within a very large proprietary C++ image processing library, compiled with GCC 4.7.2 for an ARM Cortex-A9 Linux target. Since a common symptom was glibc complaining about heap corruption, the first step was to employ a heap corruption checker to catch oob memory writes. I used the technique described in https://stackoverflow.com/a/17850402/3779334 to divert all calls to free/malloc to my own function, padding every allocated chunk of memory with some amount of known data to catch out-of-bounds writes - but found nothing, even when padding with as much as 1 KB before and after every single allocated block (there are hundreds of thousands of allocated blocks due to intensive use of STL containers, so I can't enlarge the padding further, plus I assume any write more than 1KB out of bounds would eventually trigger a segfault anyway). This bounds checker has found other problems in the past so I don't doubt its functionality.

(Before anyone says 'Valgrind', yes, I have tried that too with no results either.)

Now, my memory bounds checker also has a feature where it prepends every allocated block with a data struct. These structs are all linked in one long linked list, to allow me to occasionally go over all allocations and test memory integrity. For some reason, even though all manipulations of this list are mutex protected, the list was getting corrupted. When investigating the issue, it began to seem like the mutex itself was occasionally failing to do its job. Here is the pseudocode:

pthread_mutex_t alloc_mutex;
static bool boolmutex; // set to false during init. volatile has no effect.

void malloc_wrapper() {
  // ...
  pthread_mutex_lock(&alloc_mutex);
  if (boolmutex) {
    printf("mutex misbehaving\n");
    __THROW_ERROR__; // this happens!
  }
  boolmutex = true;
  // manipulate linked list here
  boolmutex = false;
  pthread_mutex_unlock(&alloc_mutex);
  // ...
}

The code commented with "this happens!" is occasionally reached, even though this seems impossible. My first theory was that the mutex data structure was being overwritten. I placed the mutex within a struct, with large arrays before and after it, but when this problem occurred the arrays were untouched so nothing seems to be overwritten.

So.. What kind of corruption could possibly cause this to happen, and how would I find and fix the cause?

A few more notes. The test program uses 3-4 threads for processing. Running with less threads seems to make the corruptions less common, but not disappear. The test runs for about 20 seconds each time and completes successfully in the vast majority of cases (I can have 10 units repeating the test, with the first failure occurring after 5 minutes to several hours). When the problem occurs it is quite late in the test (say, 15 seconds in), so this isn't a bad initialization issue. The memory bounds checker never catches actual out of bounds writes but glibc still occasionally fails with a corrupted heap error (Can such an error be caused by something other than an oob write?). Each failure generates a core dump with plenty of trace information; there is no pattern I can see in these dumps, no particular section of code that shows up more than others. This problem seems very specific to a particular family of algorithms and does not happen in other algorithms, so I'm quite certain this isn't a sporadic hardware or memory error. I have done many more tests to check for oob heap accesses which I don't want to list to keep this post from getting any longer.

Thanks in advance for any help!

Seems extremely unlikely we could help you here. You seem to have done most of what you should in this situation, except for narrowing it down to a smaller program. And, yes, I know that's difficult and time-consuming in cases like this. But, no, that doesn't magically make it any less crucial. Good luck! — Lightness Races in Orbit, Jul 05 '15 at 15:45
Might want to try Vagrind with this code and see if it spots anything related — Glenn Teitelbaum, Jul 05 '15 at 15:49
Can you use gcc 4.8 & try AddressSanitizer (https://code.google.com/p/address-sanitizer/wiki/AddressSanitizer) on your platform? Read https://en.wikipedia.org/wiki/AddressSanitizer. — , Jul 05 '15 at 15:56
Are you getting an exception in a thread that has locked the *mutex* that is, therefore not unlocking again? It may be worth making a *smart lock* type object (like a [std::lock_guard](http://en.cppreference.com/w/cpp/thread/lock_guard)) that releases the *mutex* when it goes out of scope. — Galik, Jul 05 '15 at 15:57
A few suggestions: Try to reduce optimizations or switch to a different compiler. Try to use an off-the-shelf memory debugger like debauch or dmalloc. Does your memory debugger catch use-after-free errors? Switch to a different platform. That said, you will need to find a *minimal* example, I'm afraid. — Ulrich Eckhardt, Jul 05 '15 at 16:12

score 0 · Accepted Answer · answered Oct 15 '15 at 09:20

Thanks to all commenters. I've tried nearly all suggestions with no results, when I finally decided to write a simple memory allocation stress test - one that would run a thread on each of the CPU cores (my unit is a Freescale i.MX6 quad core SoC), each allocating and freeing memory in random order at high speed. The test crashed with a glibc memory corruption error within minutes or a few hours at most.

Updating the kernel from 3.0.35 to 3.0.101 solved the problem; both the stress test and the image processing algorithm now run overnight without failing. The problem does not reproduce on Intel machines with the same kernel version, so the problem is specific either to ARM in general or perhaps to some patch Freescale included with the specific BSP version that included kernel 3.0.35.

For those curious, attached is the stress test source code. Set NUM_THREADS to the number of CPU cores and build with:

<cross-compiler-prefix>g++ -O3 test_heap.cpp -lpthread -o test_heap

I hope this information helps someone. Cheers :)

// Multithreaded heap stress test. By Itay Chamiel 20151012.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <pthread.h>
#include <sys/time.h>

#define NUM_THREADS 4 // set to number of CPU cores

#define ALIVE_INDICATOR NUM_THREADS

// Each thread constantly allocates and frees memory. In each iteration of the infinite loop, decide at random whether to
// allocate or free a block of memory. A list of 500-1000 allocated blocks is maintained by each thread. When memory is allocated
// it is added to this list; when freeing, a random block is selected from this list, freed and removed from the list.
void* thr(void* arg) {
    int* alive_flag = (int*)arg;
    int thread_id = *alive_flag; // this is a number between 0 and (NUM_THREADS-1) given by main()
    int cnt = 0;
    timeval t_pre, t_post;
    gettimeofday(&t_pre, NULL);

    const int ALLOCATE=1, FREE=0;
    const unsigned int MINSIZE=500, MAXSIZE=1000;
    const int MAX_ALLOC=10000;
    char* membufs[MAXSIZE];
    unsigned int membufs_size = 0;

    int num_allocs = 0, num_frees = 0;

    while(1)
    {
        int action;
        // Decide whether to allocate or free a memory block.
        // if we have less than MINSIZE buffers, allocate.
        if (membufs_size < MINSIZE) action = ALLOCATE;
        // if we have MAXSIZE, free.
        else if (membufs_size >= MAXSIZE) action = FREE;
        // else, decide randomly.
        else {
            action = ((rand() & 0x1)? ALLOCATE : FREE);
        }

        if (action == ALLOCATE) {
            // choose size to allocate, from 1 to MAX_ALLOC bytes
            size_t size = (rand() % MAX_ALLOC) + 1;
            // allocate and fill memory
            char* buf = (char*)malloc(size);
            memset(buf, 0x77, size);
            // add buffer to list
            membufs[membufs_size] = buf;
            membufs_size++;
            assert(membufs_size <= MAXSIZE);
            num_allocs++;
        }
        else { // action == FREE
            // choose a random buffer to free
            size_t pos = rand() % membufs_size;
            assert (pos < membufs_size);
            // free and remove from list by replacing entry with last member
            free(membufs[pos]);
            membufs[pos] = membufs[membufs_size-1];
            membufs_size--;
            assert(membufs_size >= 0);
            num_frees++;
        }

        // once in 10 seconds print a status update
        gettimeofday(&t_post, NULL);
        if (t_post.tv_sec - t_pre.tv_sec >= 10) {
            printf("Thread %d [%d] - %d allocs %d frees. Alloced blocks %u.\n", thread_id, cnt++, num_allocs, num_frees, membufs_size);
            gettimeofday(&t_pre, NULL);
        }

        // indicate alive to main thread
        *alive_flag = ALIVE_INDICATOR;
    }
    return NULL;
}

int main()
{
    int alive_flag[NUM_THREADS];
    printf("Memory allocation stress test running on %d threads.\n", NUM_THREADS);
    // start a thread for each core
    for (int i=0; i<NUM_THREADS; i++) {
        alive_flag[i] = i; // tell each thread its ID.
        pthread_t th;
        int ret = pthread_create(&th, NULL, thr, &alive_flag[i]);
        assert(ret == 0);
    }

    while(1) {
        sleep(10);
        // check that all threads are alive
        bool ok = true;
        for (int i=0; i<NUM_THREADS; i++) {
            if (alive_flag[i] != ALIVE_INDICATOR)
            {
                printf("Thread %d is not responding\n", i);
                ok = false;
            }
        }
        assert(ok);
        for (int i=0; i<NUM_THREADS; i++)
            alive_flag[i] = 0;
    }
    return 0;
}

What could cause a mutex to misbehave?

1 Answers1