How to store and push simulation state while minimally affecting updates per second?

Question

My app is comprised of two threads:

GUI Thread (using Qt)
Simulation Thread

My reason for using two threads is to keep the GUI responsive, while letting the Sim thread spin as fast as possible.

In my GUI thread I'm rendering the entities in the sim at an FPS of 30-60; however, I want my sim to "crunch ahead" - so to speak - and queue up game state to be drawn eventually (think streaming video, you've got a buffer).

Now for each frame of the sim I render I need the corresponding simulation "State". So my sim thread looks something like:

while(1) {
    simulation.update();
    SimState* s = new SimState;
    simulation.getAgents( s->agents ); // store agents
    // store other things to SimState here..
    stateStore.enqueue(s); // stateStore is a QQueue<SimState*>
    if( /* some threshold reached */ )
        // push stateStore
}

SimState looks like:

struct SimState {
    std::vector<Agent> agents;
    //other stuff here
};

And Simulation::getAgents looks like:

void Simulation::getAgents(std::vector<Agent> &a) const
{
    // mAgents is a std::vector<Agent>
    std::vector<Agent> a_tmp(mAgents);
    a.swap(a_tmp);
}

The Agents themselves are somewhat complex classes. The members are a bunch of ints and floats and two std::vector<float>s.

With this current setup the sim can't crunch must faster than the GUI thread is drawing. I've verified that the current bottleneck is simulation.getAgents( s->agents ), because even if I leave out the push the updates-per-second are slow. If I comment out that line I see several orders of magnitude improvement in updates/second.

So, what sorts of containers should I be using to store the simulation's state? I know there is a bunch of copying going on atm, but some of it is unavoidable. Should I store Agent* in the vector instead of Agent ?

Note: In reality the simulation isn't in a loop, but uses Qt's QMetaObject::invokeMethod(this, "doSimUpdate", Qt::QueuedConnection); so I can use signals/slots to communicate between the threads; however, I've verified a simpler version using while(1){} and the issue persists.

I'd be interested to know how you solved this, even though its likely been a while for you. I'm having a similar problem — woosah, Mar 02 '15 at 03:11

Emile Cormier · Accepted Answer · 2011-02-17T04:08:53.083

5

Try re-using your SimState objects (using some kind of pool mechanism) instead of allocating them every time. After a few simulation loops, the re-used SimState objects will have vectors that have grown to the needed size, thus avoiding reallocation and saving time.

An easy way to implement a pool is to initially push a bunch of pre-allocated SimState objects onto a std::stack<SimState*>. Note that a stack is preferable to a queue, because you want to take the SimState object that is more likely to be "hot" in the cache memory (the most recently used SimState object will be at the top of the stack). Your simulation queue pops SimState objects off the stack and populates them with the computed SimState. These computed SimState objects are then pushed into a producer/consumer queue to feed the GUI thread. After being rendered by the GUI thread, they are pushed back onto the SimState stack (i.e. the "pool"). Try to avoid needless copying of SimState objects while doing all this. Work directly with the SimState object in each stage of your "pipeline".

Of course, you'll have to use the proper synchronization mechanisms in your SimState stack and queue to avoid race conditions. Qt might already have thread-safe stacks/queues. A lock-free stack/queue might speed things up if there is a lot of contention (Intel Thread Building Blocks provides such lock-free queues). Seeing that it takes on the order of 1/50 seconds to compute a SimState, I doubt that contention will be a problem.

If your SimState pool becomes depleted, then it means that your simulation thread is too "far ahead" and can afford to wait for some SimState objects to be returned to the pool. The simulation thread should block (using a condition variable) until a SimState object becomes available again in the pool. The size of your SimState pool corresponds to how much SimState can be buffered (e.g. a pool of ~50 objects gives you a crunch-ahead time of up to ~1 seconds).

You can also try running parallel simulation threads to take advantage of multi-core processors. The Thread Pool pattern can be useful here. However, care must be taken that the computed SimStates are enqueued in the proper order. A thread-safe priority queue ordered by time-stamp might work here.

Here's a simple diagram of the pipeline architecture I'm suggesting:

pipeline architecture

(Right-click and select view image for a clearer view.)

(NOTE: The pool and queue hold SimState by pointer, not by value!)

Hope this helps.

If you plan to re-use your SimState objects, then your Simulation::getAgents method will be inefficient. This is because the vector<Agent>& a parameter is likely to already have enough capacity to hold the agent list.

The way you're doing it now would throw away this already allocated vector and create a new one from scratch.

IMO, your getAgents should be:

void Simulation::getAgents(std::vector<Agent> &a) const
{
    a = mAgents;
}

Yes, you lose exception safety, but you might gain performance (especially with the reusable SimState approach).

Another idea: You could try making your Agent objects fixed-size, by using a c-style array (or boost::array) and "count" variable instead std::vector for Agent's float list members. Simply make the fixed-size array big enough for any situation in your simulation. Yes, you'll waste space, but you might gain a lot of speed.

You can then pool your Agents using a fixed-size object allocator (such as boost::pool) and pass them around by pointer (or shared_ptr). That'll eliminate a lot of heap allocation and copying.

You can use this idea alone or in combination with the above ideas. This idea seems easier to implement than the pipeline thing above, so you might want to try it first.

Yet another idea: Instead of a thread pool for running simulation loops, you can break down the simulation into several stages and execute each stage in it's own thread. Producer/consumer queues are used to exchange SimState objects between stages. For this to be effective, the different stages need to have roughly similar CPU workloads (otherwise, one stage will become the bottleneck). This is a different way to exploit parallelism.

edited Feb 17 '11 at 04:08

answered Feb 15 '11 at 20:20

Emile Cormier

28,391
15
94
122

Wow! Fantastic answer, I definitely hadn't considered using a pool for the SimState objects. Do you have anything to add about the `getAgents` method? Is the procedure of (1) passing a reference, filling a temp vector (3) swapping the two, the best way to do it? – Casey Feb 15 '11 at 20:40
See the follow-up in my answer. The temp and swap idiom is a good way to ensure exception safety, but at the (possible) cost of performance degradation. Seeing that you found that method to be a bottleneck, then sacrificing exception safety would be justified, IMO. Just make sure to profile again to make sure you actually gained performance. – Emile Cormier Feb 15 '11 at 20:56
Added another follow-up about making Agent fixed-size, using a pool allocator, and passing Agent around by pointer. – Emile Cormier Feb 15 '11 at 21:17
Thank you for invaluable advice. I did some benchmarks, and creating a Stack of SimStates with pre-sized vectors takes a significant amount of time (proving your point), much more than the copying (which was what I thought was the bottleneck). I really like the pipeline architecture, and am researching the best way to do it within Qt. Qt's queue isn't thread safe, so I'd have to implement a wrapper to do the synchronization. However b/c won't the atomic mutex operations slow the sim loop down considerably? – Casey Feb 15 '11 at 21:27
Typo in that last sentence. Ignore the 'b/c'. – Casey Feb 15 '11 at 21:34
You only need to hold the mutex while accessing the queue. Accessing a queue takes nanoseconds/microseconds (assuming the enqueued objects are merely pointers). But updating the queue only happens once every 1/50 seconds. The contention over the queue's mutex will therefore be negligable, IMO. Read up on *producer/consumer queues* if this isn't clear. Keep in mind that SimStates are passed around by pointer in the pipeline architecture I described (I goofed up the diagram). – Emile Cormier Feb 15 '11 at 21:35
Added another idea in my answer. – Emile Cormier Feb 15 '11 at 21:56
I removed the `std::vector` from the Agent (and its dependents) replacing it with a `std::tr1::array`, and already there is a substantial speedup. Looking into the `boost::pool` documentation.. I'm not sure how helpful it will be, aside from the `getAgents()` method the Agent's aren't really passed often. – Casey Feb 15 '11 at 22:05
I'd be curious to see any before/after figures you can report while attempting these ideas. :-) – Emile Cormier Feb 15 '11 at 22:14
I see now that passing `Agent` around by pointer is probably not helpful. `Simulation` needs it own set of agents independent of the ones stored in `SimState`. Hence the need to copy `Agent` by value. Eliminating `std::vector` from `Agent` speeded-up the copying process probably because agents can now be simply memcpy'd, instead of being copied member-by-member. – Emile Cormier Feb 15 '11 at 22:53
My Sim thread now pops a SimState* from a stack, loads it up with state, then puts it in a queue. Periodically it sends this Queue via Qt's signal/slots to the GUI thread. Signals use the event mechanism which is thread safe (the Queue itself isn't). The GUI thread renders the states, then sends the "empty" SimState objects back to the Sim thread (via a signal), where it ends up in a stack and the process repeats. – Casey Feb 15 '11 at 23:02
I'm pre-allocating 500 SimState* objects, which works pretty well. The GUI thread is always busy, though once the Sim thread runs out of state objects it slows to a crawl again as it manually allocates new SimState* objects. – Casey Feb 15 '11 at 23:03
Running the simulation w/out state saving and rendering I can get about 2500 ticks/sec on my dual core laptop. While the pre-allocated state objects hold out I get around 600 ticks/sec, and when they run out I get around 80 ticks/sec. The bottleneck is definitely that allocation of SimState and the resizing of its vector. The producer is out producing the consumer. – Casey Feb 15 '11 at 23:05
Oh, before I implemented your suggestions and ran the renderer alongside the computation I got around 50 ticks per second. So if I allocate enough SimState* objects ahead of time I can maintain a good ticks/sec for some time, although since the simulation is producing SimState*s at such a high rate it might be unfeasible to expect there to always be enough SimStates. I need to do the math.. – Casey Feb 15 '11 at 23:23
Don't allocate any more `SimState`'s than what is in the original pool! Once you deplete the pool, the simulation thread should block until a SimState becomes available again in the pool. Use the same blocking mechanism that you use in your producer/consumer queues (typically, this is done with a condition variable). If you get to the point that the pool is depleted, then that means that the simulation thread has done it's job well in "pre-buffering" `SimState` objects. – Emile Cormier Feb 15 '11 at 23:24
...The number of initial `SimState` objects in the pool represents the amount of "pre-buffering" your simulation thread can do. If you want to have a 10 second buffer at 50fps, then that means 10*50=500 `SimState` total objects in the pool. Note that the total number of `SimState` objects currently the pipeline is equal to the number of initial `SimState` objects in the pool. This would be easier to explain if I were beside you with a whiteboard, lol. – Emile Cormier Feb 15 '11 at 23:31
Also note that the simulation thread blocking on available SimState packets in the pool is a means to throttle back the simulation thread (at the point of maximum buffering) so that it does not produce faster than the consumer can consume. The initial batch of `SimState` objects are perpetually recycled within the pipeline. – Emile Cormier Feb 15 '11 at 23:40
Ah yes! I understood how the pool represented the pre-buffer, but when the pool ran out I was either (1) allocating new objects or (2) tick()ing anyway and not saving. I like your suggestion of blocking instead, so I don't miss any tick()s and don't incur the slowdown of allocating new objects. As for the whiteboard: it's unnecessary, you're being a great help, and your explanations are great. – Casey Feb 15 '11 at 23:45
The thread pool idea seems overkill at this point, seeing that your simulation thread can already out-produce the consumer thread. If you get to point where you need more simulation horsepower, and you've got some CPU cores idling, then you can try the thread pool pattern. – Emile Cormier Feb 16 '11 at 00:52

How to store and push simulation state while minimally affecting updates per second?

1 Answers1

Linked