3

I need to run unsafe native code on a sandbox process and I need to reduce bottleneck of process switch. Both processes (controller and sandbox) shares two auto-reset events and a coherent view of a mapped file (shared memory) that is used for communication.

To make this article smaller, I removed initializations from sample code, but the events are created by the controller, duplicated using DuplicateHandle, and then sent to sandbox process prior to work.

Controller source:

void inSandbox(HANDLE hNewRequest, HANDLE hAnswer, volatile int *shared) {
  int before = *shared;
  for (int i = 0; i < 100000; ++i) {
    // Notify sandbox of a new request and wait for answer.
    SignalObjectAndWait(hNewRequest, hAnswer, INFINITE, FALSE);
  }
  assert(*shared == before + 100000);
}

void inProcess(volatile int *shared) {
  int before = *shared;
  for (int i = 0; i < 100000; ++i) {
    newRequest(shared);
  }
  assert(*shared == before + 100000);
}

void newRequest(volatile int *shared) {
  // In this test, the request only increments an int.
  (*shared)++;
}

Sandbox source:

void sandboxLoop(HANDLE hNewRequest, HANDLE hAnswer, volatile int *shared) {
  // Wait for the first request from controller.
  assert(WaitForSingleObject(hNewRequest, INFINITE) == WAIT_OBJECT_0);
  for(;;) {
    // Perform request.
    newRequest(shared);
    // Notify controller and wait for next request.
    SignalObjectAndWait(hAnswer, hNewRequest, INFINITE, FALSE);
  }
}

void newRequest(volatile int *shared) {
  // In this test, the request only increments an int.
  (*shared)++;
}

Measurements:

  • inSandbox() - 550ms, ~350k context switches, 42% CPU (25% kernel, 17% user).
  • inProcess() - 20ms, ~2k context switches, 55% CPU (2% kernel, 53% user).

The machine is Windows 7 Pro, Core 2 Duo P9700 with 8gb of memory.

An interesting fact is that sandbox solution uses 42% of CPU vs 55% of in-process solution. Another noteworthy fact is that sandbox solution contains 350k context switches, which is much more than the 200k context switches that we can infer from source code.

I need to know if there's a way to reduce the overhead of transfer control to another process. I already tried to use pipes instead of events, and it was much worse. I also tried to use no event at all, by making the sandbox call SuspendThread(GetCurrentThread()) and making the controller call ResumeThread(hSandboxThread) on every request, but the performance was similar to using events.

If you have a solution that uses assembly (like performing a manual context switch) or Windows Driver Kit, please let me know as well. I don't mind having to install a driver to make this faster.

I heard that Google Native Client does something similar, but I only found this documentation. If you have more information, please let me know.

fernacolo
  • 7,012
  • 5
  • 40
  • 61
  • There probably isn't any way to make this substantially faster. Can you explain what it is your two processes are doing? Perhaps there is another, more efficient approach. – 500 - Internal Server Error Jul 08 '12 at 22:12
  • 1
    Have you tried using RPC (with the lrpc endpoint type)? – reuben Jul 08 '12 at 22:26
  • This is an old application ported to Java that needs to run legacy native code. The native code is unsafe and causes JVM crashes. That's why we want it to run in a sandbox. The sandbox was implemented using a separated process, but we might well use a different solution. – fernacolo Jul 08 '12 at 23:32
  • I didn't try RPC/lrpc. Can you send some link about lrpc? – fernacolo Jul 08 '12 at 23:32
  • In the production system presumably both the sandbox and the controller will be doing a lot more work between context switches, so an overhead of 5.5 microseconds per handover might not be all that significant. – Harry Johnston Jul 09 '12 at 00:06
  • @HarryJohnston Unfortunately, no. Some algorithms calls five or so quick legacy functions _per record_, so if you have 100k records per batch, that becomes a big performance bottleneck. A process that takes 1 second will take 3.5 seconds - it's enough for the user to notice a delay. – fernacolo Jul 09 '12 at 00:59
  • If a few frequently used functions are short enough for the overhead to be significant, i.e., only a few thousand instructions, it might be feasible to pick out just those functions and debug them. Then you could run the quick functions in-process and the slow functions in the sandbox. But in the short term I think a spinlock should do the trick for the quick functions - provided your production machines all have at least two CPUs. – Harry Johnston Jul 09 '12 at 01:32
  • Just how hard is to fix the crashing code? – Alexey Frunze Jul 09 '12 at 11:17
  • @AlexeyFrunze The crashing code is at 3rd-party DLLs. We don't have the sources. – fernacolo Jul 09 '12 at 11:48

1 Answers1

4

The first thing to try is raising the priority of the waiting thread. This should reduce the number of extraneous context switches.

Alternatively, since you're on a 2-core system, using spinlocks instead of events would make your code much much faster, at the cost of system performance and power consumption:

void inSandbox(volatile int *lock, volatile int *shared) 
{
  int i, before = *shared;
  for (i = 0; i < 100000; ++i) {
    *lock = 1;
    while (*lock != 0) { }
  }
  assert(*shared == before + 100000);
}

void newRequest(volatile int *shared) {
  // In this test, the request only increments an int.
  (*shared)++;
}

void sandboxLoop(volatile int *lock, volatile int * shared)
{
  for(;;) {
    while (*lock != 1) { }
    newRequest(shared);
    *lock = 0;
  }
}

In this scenario, you should probably set thread affinity masks and/or lower the priority of the spinning thread so that it doesn't compete with the busy thread for CPU time.

Ideally, you'd use a hybrid approach. When one side is going to be busy for a while, let the other side wait on an event so that other processes can get some CPU time. You could trigger the event a little ahead of time (using the spinlock to retain synchronization) so that the other thread will be ready when you are.

Harry Johnston
  • 35,639
  • 6
  • 68
  • 158
  • Amazing, I got ~50ms! Almost in-process performance! Thanks very much! – fernacolo Jul 10 '12 at 14:41
  • By the way, the hybrid solution was a bit trickiest to implement - I had to use InterlockedCompareExchange to avoid some race conditions. – fernacolo Jul 10 '12 at 14:41