First, rather than using a lock, use atomic decrements and compare-and-exchange to manipulate your counter. The syntax for this varies with your compiler; in GCC you might do something like:
long remaining_cache_slots;
void release() {
__sync_add_and_fetch(&remaining_cache_slots, 1);
}
// Returns false if we've hit our cache limit
bool acquire() {
long prev_value, new_value;
do {
prev_value = remaining_cache_slots;
if (prev_value <= 0) return false;
new_value = prev_value - 1;
} while(!__sync_bool_compare_and_swap(&remaining_cache_slots, prev_value, new_value));
return true;
}
This should help reduce the window for contention. However, you'll still be bouncing that cache line all over the place, which at a high request rate can severely hurt your performance.
If you're willing to accept a certain amount of waste (ie, allowing the number of cached results - or rather, pending responses - to go slightly below the limit), you have some other options. One is to make the cache thread-local (if possible in your design). Another is to have each thread reserve a pool of 'cache tokens' to use.
What I mean by reserving a pool of cache tokens is that each thread can reserve ahead of time the right to insert N entries into the cache. When that thread removes an entry from the cache it adds it to its set of tokens; if it runs out of tokens, it tries to get them from a global pool, and if it has too many, it puts some of them back. The code might look a bit like this:
long global_cache_token_pool;
__thread long thread_local_token_pool = 0;
// Release 10 tokens to the global pool when we go over 20
// The maximum waste for this scheme is 20 * nthreads
#define THREAD_TOKEN_POOL_HIGHWATER 20
#define THREAD_TOKEN_POOL_RELEASECT 10
// If we run out, acquire 5 tokens from the global pool
#define THREAD_TOKEN_POOL_ACQUIRECT 5
void release() {
thread_local_token_pool++;
if (thread_local_token_pool > THREAD_TOKEN_POOL_HIGHWATER) {
thread_local_token_pool -= THREAD_TOKEN_POOL_RELEASECT;
__sync_fetch_and_add(&global_token_pool, THREAD_TOKEN_POOL_RELEASECT);
}
}
bool acquire() {
if (thread_local_token_pool > 0) {
thread_local_token_pool--;
return true;
}
long prev_val, new_val, acquired;
do {
prev_val = global_token_pool;
acquired = std::min(THREAD_TOKEN_POOL_ACQUIRECT, prev_val);
if (acquired <= 0) return false;
new_val = prev_val - acquired;
} while (!__sync_bool_compare_and_swap(&remaining_cache_slots, prev_value, new_value));
thread_local_token_pool = acquired - 1;
return true;
}
Batching up requests like this reduces the frequency at which threads access shared data, and thus the amount of contention and cache churn. However, as mentioned before, it makes your limit a bit less precise, and so requires careful tuning to get the right balance.