Implementing Thread Local Storage in Software

Question

We are porting an embedded application from Windows CE to a different system. The current processor is an STM32F4. Our current codebase heavily uses TLS. The new prototype is running KEIL CMSIS RTOS which has very reduced functionality.

On http://www.keil.com/support/man/docs/armcc/armcc_chr1359124216560.htm it says that thread local storage is supported since 5.04. Right now we are using 5.04. The problem is that when linking our program with a variable definition of __thread int a; the linker cannot find __aeabi_read_tp which makes sense to me.

My question is: Is it possible to implement __aeabi_read_tp and it will work or is there more to it?

If it simply is not possible for us: Is there a way to implement TLS only in software? Let's not talk about performance there for now.

EDIT I tried implementing __aeabi_read_tp by looking at old source of freeBSD and other sources. While the function is mostly implemented in assembly I found a version in C which boils down to this:

extern "C"
{
    extern osThreadId svcThreadGetId(void);
    void *__aeabi_read_tp()
    {
        return (void*)svcThreadGetId();
    }
}

What this basically does is give me the ID (void*) of my currently executing thread. If I understand correctly that is what we want. Can this possibly work?

I think that your proposed solution can possibly work, but in my opinion symbol __aeabi_read_tp should be provided by the either standard C library or compiler runtime library. — smbear, Jul 09 '15 at 10:56
You are basically right but the compiler and the RTOS are not delivered together. I have also posted this questions to the KEIL developers. I'm excited to see if it works. — clambake, Jul 09 '15 at 11:28

smbear · Answer 1 · 2015-07-09T08:48:48.880

Not considering the performance and not going into CMIS RTOS specifics (which are unknown to me), you can allocate space needed for your variables - either on heap or as static or global variable - I would suggest to have an array of structures. Then, when you create thread, pass the pointer to the next not used structure to your thread function.

In case of static or global variable, it would be good if you know how many threads are working in parallel for limiting the size of preallocated memory.

EDIT: Added sample of TLS implementation based on pthreads:

#include <pthread.h>

#define MAX_PARALLEL_THREADS 10

static pthread_t threads[MAX_PARALLEL_THREADS];
static struct tls_data tls_data[MAX_PARALLEL_THREADS];
static int tls_data_free_index = 0;

static void *worker_thread(void *arg) {
    static struct tls_data *data = (struct tls_data *) arg;

    /* Code omitted. */
}

static int spawn_thread() {
    if (tls_data_free_index >= MAX_PARALLEL_THREADS) {
        // Consider increasing MAX_PARALLEL_THREADS
        return -1;
    }

    /* Prepare thread data - code omitted. */

    pthread_create(& threads[tls_data_free_index], NULL, worker_thread, & tls_data[tls_data_free_index]);
}

score 1 · Answer 2 · edited May 23 '17 at 10:26

1

The not-so-impressive solution is a std::map<threadID, T>. Needs to be wrapped with a mutex to allow new threads.

For something more convoluted, see this idea

edited May 23 '17 at 10:26

Community

1
1

answered Jul 09 '15 at 08:30

MSalters

173,980
10
155
350

"I offer this horrifying nightmare creation" I love that! :) – clambake Jul 09 '15 at 08:41

score 1 · Answer 3 · answered Mar 17 '18 at 02:10

I believe this is possible, but probably tricky.

Here's a paper describing how __thread or thread_local behaves in ELF images (though it doesn't talk about ARM architecture for AEABI):

https://www.akkadia.org/drepper/tls.pdf

The executive summary is:

The linker creates .tbss and/or .tdata sections in the resulting executable to provide a prototype image of the thread local data needed for each thread.
At runtime, each thread control block (TCB) has a pointer to a dynamic thread-local vector table (dtv in the paper) that contains the thread-local storage for that thread. It is lazily allocated and initialized the first time a thread attempts to access a thread-local variable. (presumably by __aeabi_read_tp())
Initialization copies the prototype .tdata image and memsets the .tbss image into the allocated storage.
When source code access thread-local variables, the compiler generates code to read the thread pointer from __aeabi_read_tp(), and do all the appropriate indirection to get at the storage for that thread-local variable.

The compiler and linker is doing all the work you'd expect it to, but you need to initialize and return a "thread pointer" that is properly structured and filled out the way the compiler expects it to be, because it's generating instructions directly to follow the hops.

There are a few ways that TLS variables are accessed, as mentioned in this paper, which, again, may or may not totally apply to your compiler and architecture:

http://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt

But, the problems are roughly the same. When you have runtime-loaded libraries that may bring their own .tbss and .tdata sections, it gets more complicated. You have to expand the thread-local storage for any thread that suddenly tries to access a variable introduced by a library loaded after the storage for that thread was initialized. The compiler has to generate different access code depending on where the TLS variable is declared. You'd need to handle and test all the cases you would want to support.

It's years later, so you probably already solved or didn't solve your problem. In this case, it is (was) probably easiest to use your OS's TLS API directly.

Implementing Thread Local Storage in Software

3 Answers3