I was hoping to use the C++11 thread_local
keyword for a per-thread boolean flag that is going to be accessed very frequently.
However, most compilers seem to implemented thread local storage with a table that maps integer IDs (slots) to the variable's address on the current thread. This lookup would happen inside a performance-critical code path, so I have some concerns about its performance.
The way I would have expected thread local storage to be implemented is by allocating virtual memory ranges that are backed by different physical pages depending on the thread. That way, accessing the flag would be the same cost as any other memory access, since the MMU takes care of the mapping.
Why do none of the mainstream compilers take advantage of page table mappings in this way?
I suppose I can implement my own "thread-specific page" with mmap
on Linux and VirtualAlloc
on Win32, but this seems like a pretty common use-case. If anyone knows of existing or better solutions, please point me to them.
I've also considered storing an std::atomic<std::thread::id>
inside each object to represent the active thread, but profiling shows that the check for std::this_thread::get_id() == active_thread
is quite expensive.