3

I created 2 Linux kernel threads in my loadable module and I bind them to separate CPU cores running on a dual core Android device. After I run this few times, I noticed that the device reboots with a HW watchdog timer reset. I hit the issue consistently. What could be causing the deadlock?

Basically, what i need to do is, make sure both the threads run do_something() at the same time on different cores without anybody stealing the cpu cycles(i.e. interrupts are disabled). I am using a spinlock and a volatile variable for this. I also have a semaphore for parent thread to wait on child thread.

#define CPU_COUNT 2

/* Globals */
spinlock_t lock;
struct semaphore sem;
volatile unsigned long count;

/* Thread util function for binding the thread to CPU*/
struct task_struct* thread_init(kthread_fn fn, void* data, int cpu)
{
    struct task_struct *ts;

    ts=kthread_create(fn, data, "per_cpu_thread");
    kthread_bind(ts, cpu);
    if (!IS_ERR(ts)) {
        wake_up_process(ts);
    }
    else {
        ERR("Failed to bind thread to CPU %d\n", cpu);
    }
    return ts;
}

/* Sync both threads */
void thread_sync()
{   
    spin_lock(&lock);
    ++count;
    spin_unlock(&lock); 

    while (count != CPU_COUNT);
}

void do_something()
{
}

/* Child thread */
int per_cpu_thread_fn(void* data)
{
    int i = 0;
    unsigned long flags = 0;
    int cpu = smp_processor_id();

    DBG("per_cpu_thread entering (cpu:%d)...\n", cpu);

    /* Disable local interrupts */
    local_irq_save(flags);

    /* sync threads */
    thread_sync();

    /* Do something */
    do_something();

    /* Enable interrupts */
    local_irq_restore(flags);

    /* Notify parent about exit */
    up(&sem);
    DBG("per_cpu_thread exiting (cpu:%d)...\n", cpu);
    return value;
}

/* Main thread */
int main_thread()
{
    int cpuB;
    int cpu = smp_processor_id();
    unsigned long flags = 0;

    DBG("main thread running (cpu:%d)...\n", cpu);

    /* Init globals*/
    sema_init(&sem, 0);
    spin_lock_init(&lock);
    count = 0;

    /* Launch child thread and bind to the other CPU core */
    if (cpu == 0) cpuB = 1; else cpuB = 0;        
    thread_init(per_cpu_thread_fn, NULL, cpuB);

    /* Disable local interrupts */
    local_irq_save(flags);

    /* thread sync */
    thread_sync();

    /* Do something here */
    do_something();

    /* Enable interrupts */
    local_irq_restore(flags);

    /* Wait for child to join */
    DBG("main thread waiting for all child threads to finish ...\n");
    down_interruptible(&sem);
}
Gupta
  • 33
  • 1
  • 4

1 Answers1

0

I'm not sure, this is a real reason, but your code contains some serious errors.

First in while (count != CPU_COUNT);. You must not read shared variable without holding a lock, unless read is atomic. With count it isn't guaranteed to be.

You must protect read of count with lock. You can replace your while-loop with following:

unsigned long local_count;
do {
    spin_lock(&lock);
    local_count = count;
    spin_unlock(&lock);
} while (local_count != CPU_COUNT);

Alternatively, you could use atomic types. Notice absence of locking

atomic_t count = ATOMIC_INIT(0);

...

void thread_sync() {
    atomic_inc(&count);
    while (atomic_read(&count) != CPU_COUNT);
}

Second problem with interrupts. I think, you don't understand what you are doing.

local_irq_save() saves and disables interrupts. Then, you disable interrupts again with local_irq_disable(). After some work, you restore previous state with local_irq_restore(), and enable interrupts with local_irq_enable(). This enabling is totally wrong. You enable interrupts, regardless of theirs previous state.

Third problem. If main thread isn't binded to a cpu, you should not use smp_processor_id() unless you are sure that kernel will not reschedule right after you get a cpu number. It's better to use get_cpu(), which disables kernel preemption and then returns cpu id. When done, call put_cpu().

But, when you call get_cpu(), this is a bug to create and run other threads. That's why you should set affinity of main thread.

Fourth. local_irq_save() and local_irq_restore() macros that takes a variable, not a pointer to unsigned long. (I've got an error and some warnings passing pointers. I wonder how did you compile your code). Remove referencing

The final code is available here: http://pastebin.com/Ven6wqWf

Alexey Shmalko
  • 3,678
  • 1
  • 19
  • 35
  • Thanks Rasen for your reply. I fixed the interrupt calls but still see the problem. I cannot move while(count != CPU_COUNT) inside spin_lock() since that will deadlock immediately. Do you recommend any other way? My requirement is that both thread should start executing do_something() about the same time. – Gupta Aug 04 '13 at 21:14
  • You must protect read of `count` with lock. I've edited my post to show how to do this. – Alexey Shmalko Aug 04 '13 at 22:39
  • Thanks once again for the pointers. This does not solve my problem yet. I am seeing that one of the threads is spinning forever while the other thread never increments the counter. Do you see any issue with the way thread is created? – Gupta Aug 05 '13 at 19:23
  • I've wrote simplified version to test syncing and noticed some issues. Expanded my answer. BTW, when fix issues, syncing works fine for me. – Alexey Shmalko Aug 05 '13 at 21:36
  • Really appreciate your help on this Rasen. I originally had get_cpu(), I now put it back. Regarding local_irq_save/restore() you are right, they don't take pointer type. This was a typo on my end when i pasted the code here. I edited my question to this effect. Also, I run this in a loop for 100 times(I call IOCTL from user space) and it works fine for more than 90 times and eventually main thread ends up spinning on the counter. Deadlock does not happen immediately. – Gupta Aug 05 '13 at 22:04
  • I've solved this problem. You can consult with my code here http://pastebin.com/Ven6wqWf The last problem was with affinity of main thread. If you don't set it, you should use `get_cpu()`. But with `get_cpu` you are in atomic operation and cannot create threads. I've solved it with explicitly seting affinity of main thread. – Alexey Shmalko Aug 09 '13 at 10:38
  • Thanks Rasen for taking time and helping with this. Really appreciate it. – Gupta Oct 29 '13 at 22:44