We are completing a generic service-bus implementation on C, with clients for C#, Delphi, PL/SQL and PHP. The library works great, we have awesome performance for our bus unless the MongoDB database is running on Windows (tested on 2008 R2, 2003 and 7) and there's no other "special" program running.
Out test do the following:
- Program A sends a message on a capped collection
- Program B tails on the message queue collection and "wakes up" when message appear using a cursor with awaitData param set to true
- When Program B wakes up, prepares a messsage and sends a response to Program A inserting a document on a specific collection by Program A
- Program A was already waiting on the second "response" collection and gets awaken when Program B (the producer) sent the response back
- Loop ends there
Our testing program counts the loop and reports performance on a console application compiled with Visual Studio 2010.
We run this everything on one machine, or using a different machine for MongoDB and running consumer and producer on the same machine. We run this on Windows 2008R2, Windows 2003 and Windows 7. For 2008R2 we used the special mongo build for that OS, while for 2003 and 7 we used the "legacy" 64 bits build.
In a clean OS, with no programs running, our test performs about 32-50 roundtrips per second, which is a lousy performance compared to the "good" results we get when everything goes full speed.
Now, here comes the strange thing:
When starting certain application on the same machine where the mongo database runs, our tests speed up to about 450/sec (when running over loopback everything on the same machine) to about 300/sec when consumer and producer run on one machine, and mongodb in another machine going over the network.
The reason we never noticed this problem consistently before was because pretty much all the time we had in our development vms Visual Studio open, and Visual Studio is a program that acts as a "mongodb accelerator" (I know this sounds ridiculous, please don't bash me on this statement).
At first we noticed this issue "randomly" essentially when running our tests without VS open. So we tend to blame it on the underlying SAN where vmware runs, or the vm hosts, or cosmic rays or the NSA snooping on our program. This was until we figured out finally the correlation between VS open at the same time while we were running tests, and narrowed down to the following:
MongoDB running on a Windows system (as console OR as service), virtual or physical versions 2008R2, 2003 or 7 will run slow a pattern of receiving data on a capped collection and waking up a tailing cursor then sending a response back to the consumer on another capped collection in the same way unless you simply start a program such as Visual Studio, Delphi XE4, Google Chrome browser, CrystalDiskMark disk I/O testing program (other program may speed up Mongo too). Then mongodb speeds up on full order of magnitude the pattern mentioned before.
We could not find exactly what these programs have in common that may cause the issue.
At this point we are stunned by the issue, I even reviewed the MongoDB code used for tailable cursor, but didn't find anything that smells as potentially causing a problem. The code pretty much spins for a max of 4 seconds waiting for data to appear, besides the suspicious "sleep" call on every loop, there was nothing else eye catching.
Is it possible that certain programs end up causing Sleep() Windows API call to behave differently? And that makes mongo do this operations on tailable cursor slower??
We think something is indeed "slowing down" because also the CPU utilization profiles goes down, like mongodb is literally "waiting" for something when it's running slow.
I know this pattern works fine on unix/linux based systems, I tried the same codebase on a Mac with no issues, so this horribly smells as a Windows issue.
Anyone else experienced a similar issue out there?