3

I have some random issues sometimes to join pthread. I can just say that the thread is not stuck in a deadlock with a mutex when the join is failing. Most of the time the thread is idle (sleep syscall) when the timeout occurred on join.

My need is basic. A way to start/stop a thread from the main thread. So I don't need to put mutex in start/stop manager on pthread state variable. The thread is working as an infinite loop most of the time. All my thread are designed with the same skeleton. A start and stop function. The thread function definition. I have a global variable g_event_ctx to store the current status of the thread. running to know I need to cancel it. is_joinable to know if I need to join the thread. Moreover I have sleep/read/write syscall on all my thread function (cancel point !)

typedef struct pthread_context
{
    pthread_t id;       /*!< pthread_t to be able to stop the thread later */
    int running;        /*!< allow to know if the thread is currently running */
    int is_joinable;    /*!< allow to know if the thread is joinable */
} str_pthread_context;

The code of the skeleton :

   int start_x_manager (void)
    {
        pthread_t t_x;

        if (g_event_ctx.x_thread.is_joinable) return 0;

        PRINT_INFO ("Start x manager");

        // start push x thread
        if (pthread_create (&t_x, NULL, x_loop_thread, NULL))
            PRINT_ERR_GOTO ("error on pthread_create for x thread");
        pthread_setname_np(t_x, "x");
        g_event_ctx.x_thread.id = t_x;
        g_event_ctx.x_thread.is_joinable = 1;
        g_event_ctx.x_thread.running = 1;
        return 0;
    error:
        g_event_ctx.x_thread.running = 0;
        g_event_ctx.x_thread.is_joinable = 0;
        return 1;
    }

    int stop_x_manager (void)
    {
        struct timespec ts;

        if (!g_event_ctx.x_thread.is_joinable) return 0;
        PRINT_INFO ("Stop x manager");

        if (g_event_ctx.x_thread.running)
        {
            CHECK_ERR_GOTO (pthread_cancel(g_event_ctx.x_thread.id) != 0, "Cannot cancel x thread");
            g_event_ctx.x_thread.running = 0;
        }
        CHECK_ERR_GOTO (clock_gettime(CLOCK_REALTIME, &ts) == -1, "Cannot get clock time");
        ts.tv_sec += 5;
        CHECK_ERR_GOTO (pthread_timedjoin_np (g_event_ctx.x_thread.id, NULL, &ts) != 0, "Cannot join x_thread");
        g_event_ctx.x_thread.is_joinable = 0;
        return 0;
    error:
        g_event_ctx.x_thread.running = 0;
        g_event_ctx.x_thread.is_joinable = 0;
        return 1;
    }

The skeleton of the thread function :

void *x_loop_thread (void *arg __attribute__((__unused__)))
{

    CHECK_ERR_GOTO (pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL) != 0, "Cannot set cancel state");
    CHECK_ERR_GOTO (pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL) != 0, "Cannot set cancel state");

    PRINT_INFO ("Start x manager loop thread ...");

    pthread_cleanup_push(x_manager_cleanup, some_stuff);

    while (1)
    {
         // Do some stuff here
    }
    g_event_ctx.x_thread.running = 0;
    pthread_exit (NULL);
  error:
    g_event_ctx.x_thread.running = 0;
    pthread_cleanup_pop(1);
    pthread_exit (NULL);
}

CHECK_ERR_GOTO is a macro which check a condition to know if I need to jump to label error.

What is the reason which can explain a timeout on the pthread_timedjoin_np ? Another piece of code which corrupted my thread id ? Is there a problem of design in my skeleton ?

ArthurLambert
  • 749
  • 1
  • 7
  • 30
  • What is x manager loop? It may be blocking more than 5 secs unless we know it is not. – anand Mar 04 '19 at 09:52
  • 1
    I have plenty of manager with differents piece of code. Most of the time, the loop manager code is waiting for a gpio with select. Or reading/writing data. Or sleeping. All the blocking code are a cancellation point so the pthread join must always almost responding instantly right ? – ArthurLambert Mar 04 '19 at 10:21
  • Blocking calls doesn’t mean thread will exit or get cancelled. If thread is waiting, then join call will also block until thread exits. – anand Mar 04 '19 at 12:59
  • 1
    I am using PTHREAD_CANCEL_DEFERRED thread. If my thread is doing a sleep (999999) the cancel and join will clean the thread instantly. – ArthurLambert Mar 04 '19 at 13:25
  • Cancellation point doesn’t include select call. Will it be possible to use gdb(threads supported). On target gdb server can run. With gdb it is possible to check where thread is executing. – anand Mar 05 '19 at 04:36
  • Most of my manager is waiting for an event on gpio with select and no issue to cancel/join them. Moreover select is in the list of cancellation point : http://man7.org/linux/man-pages/man7/pthreads.7.html – ArthurLambert Mar 05 '19 at 09:58
  • One link on internet on man page shows select is not in the list. Good reference is man command on system. Gdb can be helpful. – anand Mar 06 '19 at 02:34
  • Problem is that I am not able to reproduce the issue in determinist way :( I am able to reproduce the issue after several 2 or 3 hours of stress test. I add thread id in log to be able to check possible corruption on thread id. But this is not that. Everything seems fine.. – ArthurLambert Mar 06 '19 at 08:38
  • @ArthurLambert Can you reproduce the error after e.g. several hours of stress test on your skeleton version as well? Can you post a complete skeleton here or e.g. github? Also post details on Linux kernel version, pthreads lib version and version/info on other relevant libs. Are you using glibc or e.g. ulibc? – Erik Alapää Mar 07 '19 at 12:32
  • I never tried to put the skeleton in a stand alone binary to run my stress test. Unfortunately I cannot share the code.. I am using a custom linux based on 4.1.15 on ARM target. I am using uclibc-ng 1.0.27. My firmware is build thanks to Buildroot. I will update all my buildroot next week to see if I am able to fix the issue... – ArthurLambert Mar 07 '19 at 17:29
  • At least there is not big mistake in my way to use pthread library in my skeleton ? All my thread are using sleep/read/write very often. I neved used explicit cancellation point. – ArthurLambert Mar 07 '19 at 17:31
  • Have you ruled out the question of select not being a cancellation point in uclibc? Or maybe some weird cancellation behavior with uclibc select and gpio? – Erik Alapää Mar 08 '19 at 10:50

1 Answers1

0

You can sidestep the problem by putting a variable in the context structure indicating you want the background thread to stop, setting that variable in your main thread before calling join, and checking that variable periodically in the background thread, exiting the while(1) loop if it's true. If you have any blocking calls that sleep forever, you can either have them time out and loop them with while(!want_to_stop) or, for select loops, add a file descriptor you can activate from the main thread when you want to stop (an eventfd or pipe).

Corey Mutter
  • 336
  • 2
  • 7