Limiting object allocation over multiple threads

Question

I have an application which retrieves and caches the results of a clients query and sends the results out to a client from a cache.

I have a limit on the number of items which may be cached at any one time and keeping track of this limit has has drastically reduced the applications performance when processing a large number of concurrent requests. Is there a better way to solve this problem without locking so often which may improve performance?

Edit: I've gone with the CAS approach and it seems to work pretty well.

I'm not current on c++ threading, so I'm making this a comment instead of an answer. In c#, using Interlocked could avoid at least some of the locks you use (changing SendResults being the easiest). "Atomic" and "Compare and Swap" are key terms for this topic. See this for some links regarding c++: http://stackoverflow.com/questions/54188/are-c-reads-and-writes-of-an-int-atomic and http://stackoverflow.com/questions/930897/c-atomic-operations-for-lock-free-structures — hatchet - done with SOverflow, Jun 23 '12 at 20:30
See also [a reference](http://en.cppreference.com/w/cpp/atomic) of the new C++11 atomic library, and [compiler support](http://wiki.apache.org/stdcxx/C%2B%2B0xCompilerSupport) — dyp, Jun 23 '12 at 20:36
Are you sure the performance degradation is not caused by the new caching limit? Have you tried with the cache limit configured to be very high? — Emile Cormier, Jun 23 '12 at 20:50

score 3 · Accepted Answer · answered Jun 24 '12 at 00:33

First, rather than using a lock, use atomic decrements and compare-and-exchange to manipulate your counter. The syntax for this varies with your compiler; in GCC you might do something like:

long remaining_cache_slots;

void release() {
  __sync_add_and_fetch(&remaining_cache_slots, 1);
}

// Returns false if we've hit our cache limit
bool acquire() {
  long prev_value, new_value;
  do {
    prev_value = remaining_cache_slots;
    if (prev_value <= 0) return false;
    new_value = prev_value - 1;
  } while(!__sync_bool_compare_and_swap(&remaining_cache_slots, prev_value, new_value));
  return true;
}

This should help reduce the window for contention. However, you'll still be bouncing that cache line all over the place, which at a high request rate can severely hurt your performance.

If you're willing to accept a certain amount of waste (ie, allowing the number of cached results - or rather, pending responses - to go slightly below the limit), you have some other options. One is to make the cache thread-local (if possible in your design). Another is to have each thread reserve a pool of 'cache tokens' to use.

What I mean by reserving a pool of cache tokens is that each thread can reserve ahead of time the right to insert N entries into the cache. When that thread removes an entry from the cache it adds it to its set of tokens; if it runs out of tokens, it tries to get them from a global pool, and if it has too many, it puts some of them back. The code might look a bit like this:

long global_cache_token_pool;
__thread long thread_local_token_pool = 0;

// Release 10 tokens to the global pool when we go over 20
// The maximum waste for this scheme is 20 * nthreads
#define THREAD_TOKEN_POOL_HIGHWATER 20
#define THREAD_TOKEN_POOL_RELEASECT 10

// If we run out, acquire 5 tokens from the global pool
#define THREAD_TOKEN_POOL_ACQUIRECT 5

void release() {
  thread_local_token_pool++;

  if (thread_local_token_pool > THREAD_TOKEN_POOL_HIGHWATER) {
    thread_local_token_pool -= THREAD_TOKEN_POOL_RELEASECT;
    __sync_fetch_and_add(&global_token_pool, THREAD_TOKEN_POOL_RELEASECT);
  }
}

bool acquire() {
  if (thread_local_token_pool > 0) {
    thread_local_token_pool--;
    return true;
  }

  long prev_val, new_val, acquired;
  do {
    prev_val = global_token_pool;
    acquired = std::min(THREAD_TOKEN_POOL_ACQUIRECT, prev_val);
    if (acquired <= 0) return false;

    new_val = prev_val - acquired;
  } while (!__sync_bool_compare_and_swap(&remaining_cache_slots, prev_value, new_value));

  thread_local_token_pool = acquired - 1;

  return true;
}

Batching up requests like this reduces the frequency at which threads access shared data, and thus the amount of contention and cache churn. However, as mentioned before, it makes your limit a bit less precise, and so requires careful tuning to get the right balance.

Atomic is the way to go in this case. +1 – Jun 24 '12 at 00:36 — , Jun 24 '12 at 00:36

Emile Cormier · Answer 2 · 2012-06-23T21:13:40.573

1

In SendResults, try updating totalResultsCached only once after you process the results. This will minimize the time spent acquiring/releasing the lock.

void SendResults( int resultsToSend, Request *request )
{
    for (int i=0; i<resultsToSend; ++i)
    {
        send(request.remove())
    }

    lock totalResultsCached 
    totalResultsCached -= resultsToSend;
    unlock totalResultsCached 
}

If resultsToSend is typically 1, then my suggestion will not make much of a difference.

Also, after hitting the cache limit, some extra requests may be dropped in ResultCallback, because SendResults is not updating totalResultsCached immediately after sending each request.

edited Jun 23 '12 at 21:13

answered Jun 23 '12 at 20:58

Emile Cormier

28,391
15
94
122

In your code above, if resultsToSend is >= 0 upon entry, won't it always have a value of 0 at the point you execute totalResultsCached -= resultsToSend? – hatchet - done with SOverflow Jun 23 '12 at 21:03
That could be more efficient. But it depends on how his solution is architected. If the two methods in his question can run on different threads, in his original code one method could be putting results into the cache WHILE results are being sent out. In your code it would be more "chunky". If totalResultsCached was at the max, more results wouldn't get put in until SendResults had sent all the results it had to process. – hatchet - done with SOverflow Jun 23 '12 at 21:10

Limiting object allocation over multiple threads

2 Answers2