1

I've read a few suggested topic but unfortunately haven't found the answer for my question. Any suggestion is highly appreciated.

So, I'm working on the huge project and here is effected code snippet:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <time.h>
#include <sys/time.h>
#include <unistd.h>
#include <pthread.h>

static pthread_mutex_t mutex;
static pthread_cond_t cond;

static char *timestamp()
{
    time_t rawtime;
    struct tm *timeinfo;
    char *stamp;

    time(&rawtime);
    timeinfo = localtime(&rawtime);
    stamp = asctime(timeinfo);
    stamp[strlen(stamp) - 1] = '\0';

    return stamp;
}

static void *monitor(void *ctx)
{
    int ret;
    struct timeval now;
    struct timespec timeout = { 0 };

    while (1)
    {
        gettimeofday(&now, NULL);

        timeout.tv_sec = now.tv_sec + 1;

        pthread_mutex_lock(&mutex);

        if (!(ret = pthread_cond_timedwait(&cond, &mutex, &timeout)))
        {
            printf("%s [%s] Signaled successfully!\n", timestamp(),
                __FUNCTION__);

        }
        else if (ret == ETIMEDOUT)
            printf("%s [%s] Timed out!\n", timestamp(), __FUNCTION__);

        pthread_mutex_unlock(&mutex);
    }
    return NULL;
}

int main()
{
    pthread_t tid;

    if (pthread_mutex_init(&mutex, NULL) < 0)
    {
        perror("mutex");
        return -1;
    }

    if (pthread_cond_init(&cond, NULL) < 0)
    {
        perror("cond");
        pthread_mutex_destroy(&mutex);
        return -1;
    }

    if (pthread_create(&tid, NULL, monitor, NULL) < 0)
    {
        perror("create()");
        pthread_cond_destroy(&cond);
        pthread_mutex_destroy(&mutex);
        return -1;
    }

    pthread_detach(tid);
    srand(time(NULL));

    while (1)
    {
        int delay = rand() % 5 + 1;

        printf("%s [main] Waker is sleeping for %d sec\n", timestamp(), delay);
        sleep(delay);

        printf("%s [main] Signaling...\n", timestamp());
        pthread_mutex_lock(&mutex);
        printf("%s [main] Lock aquired...\n", timestamp());
        pthread_cond_signal(&cond);
        printf("%s [main] Signaled. Releasing...\n", timestamp());
        pthread_mutex_unlock(&mutex);
    }

    pthread_cond_destroy(&cond);
    pthread_mutex_destroy(&mutex);

    return 0;
}

Every time when [main] thread is reaching out the pthread_cond_signal and pthread_cond_timedwait is waiting (not timed out) the stuck is happening.

Locking mutex before pthread_cond_signal is best practice, which I have read here

This topic says that such stuck could happen if cond/mutex is destroyed before waiting.

This topic describes spurious wakes up, which could lead to such stuck.

However, both seems not relevant to my case. I've also assumed that such behavior could be related to ulimit -s/-i. But setting to unlimited value doesn't help. What's interesting is that [monitor] thread also gets stuck as [main] does.

UPD

    Program output:
    Wed Jun  8 13:34:10 2022 [main] Waker is sleeping for 4 sec
    Wed Jun  8 13:34:10 2022 [monitor] Timed out!
    Wed Jun  8 13:34:11 2022 [monitor] Timed out!
    Wed Jun  8 13:34:12 2022 [monitor] Timed out!
    Wed Jun  8 13:34:13 2022 [monitor] Timed out!
    Wed Jun  8 13:34:14 2022 [main] Signaling...
    Wed Jun  8 13:34:14 2022 [main] Lock acquired...
    /*No prints at all */

UPD2:

I refactored above example just for using pthread_cond_wait like this:

[monitor thread]
pthread_mutex_lock(&mutex);
pthread_cond_wait(&cond, &mutex);
pthread_mutex_unlock(&mutex);

[main thread]
pthread_mutex_lock(&mutex);
pthread_cond_signal(&cond);
pthread_mutex_unlock(&mutex);

and I'm hanging on the pthread_cond_signal again... So, it seems like problem from OS perspective. I only know that small ulimit's stack size could lead to stack overflow, which could causing such stuck (arch depended stuff, 100 % reproducible in my case). Does anyone know some other specific configuration, which could affect it?

Volodymyr
  • 11
  • 3
  • What does the program print? – user253751 Jun 08 '22 at 10:31
  • @user253751, output attached – Volodymyr Jun 08 '22 at 10:38
  • 1
    I'm not sure if it's your issue, but `asctime` isn't thread-safe. It returns a pointer to a static buffer that both threads are reading and writing concurrently. You should use `asctime_r` or `strftime` instead. – Nate Eldredge Jun 08 '22 at 12:52
  • @NateEldredge, timestamp just added as sample and for debugging purposes. Anyway, thanks for the notice :) – Volodymyr Jun 08 '22 at 12:55
  • 2
    There's another race bug: your monitor thread unlocks and locks the mutex on each loop iteration, apart from its wait. In between those two, the main thread could lock the mutex and signal the condition. Since the monitor is not waiting on the condition at that time, it would miss the signal, and subsequent waits would never receive it. – Nate Eldredge Jun 08 '22 at 12:55
  • @Volodymyr: Yes I know, but if your debugging code is buggy, it makes it hard to know where the misbehavior is coming from. – Nate Eldredge Jun 08 '22 at 12:56
  • 2
    Now the unlock/lock race doesn't explain your hang, since even if the monitor misses a signal, the main thread will send another one in a few seconds. But it is a design problem. Normally the monitor's main loop should *not* have an explicit unlock/lock: at all times it should either be waiting or holding the mutex. If it does unlock the mutex, then before it waits again, it needs to determine (by inspecting some other program state) whether the desired event has already occurred. – Nate Eldredge Jun 08 '22 at 13:01
  • 3
    "The pthread_cond_broadcast() and pthread_cond_signal() functions shall have no effect if there are no threads currently blocked on cond." due the man page: https://linux.die.net/man/3/pthread_cond_signal – Volodymyr Jun 08 '22 at 13:02
  • 2
    Right. Like I said, it does not explain why `pthread_cond_signal` should appear to hang. In this program the only consequence would be that the monitor would miss the signal and wait for longer than expected, but in a real program (without a timeout) it could deadlock. I can't reproduce the hang you describe, so I am only able to guess. – Nate Eldredge Jun 08 '22 at 13:10
  • 1
    You are failing to check the return values for many of your mutex and CV manipulations. This is unwise in general, and it seems an especially natural direction for investigating misbehavior such as you observe. – John Bollinger Aug 16 '22 at 13:44

0 Answers0