Flocking Algorithm crashes on 200+ boids

Question

I'm working on the implementation of a flocking algorithm into a bigger system. OGRE is used for rendering, luabind is used to be able to communicate from LUA, yatta yatta, stuff that shouldn't matter too much.

I implemented the algorithm basically following reynolds' boids-model. That means, one Boid (as in "one fish in a swarm") moves according to it's neighbours in a certain radius. AS things are, the basic complexity of that is O(n²), as every boid has to check all it's flockmates if they are in range and then considers some factors to calculate the own movement.

The algorithm itself is implemented and runs smoothly. It accepts models of all different sizes, it works in 2D and 3D space, it flocks nice etc, I'm working on it for a while already.

My Problem is that the algorithm crashes on runtime as soon as I hit a kind of "barrier" in boid numbers, which is roughly 200-250 and even varies. Now, if it would get slower and slower until I'm broken to 1 fps I could understand that simply performance of the algorithm is the problem. However, for example, 199 boids work perfectly, 201 don't work at all and crash on runtime. This is highly surprising for me.

I implemented 2 classes: "Swarm" and "Boid". Swarm is used to hold pointers to all boids of a swarm but doesn't calculate much, movement happens in Boid.

swarm.h:

#ifndef __SWARM_H__
#define __SWARM_H__

#include <vector>
#include "boid.h"

namespace Core {

    class swarm {

    public:
    swarm();
    ~swarm();
    bool move(float timeSinceLastFrame);
    bool addBoid(boid *thisBoid);
    bool removeBoidByName(std::string boidName);
    bool writeAllBoidNames();
    std::vector<boid*> getFlockMates();

private:
    std::vector<boid*> flock;
    float timePassed;

};

}


#endif

boid.h:

#ifndef __BOID_H__
#define __BOID_H__

#include "Ogre.h"
#include <vector>
#include <stdlib.h>

namespace Core {

class swarm;

class boid {

    public:
        boid(Ogre::SceneNode *thisNode, Ogre::String orientation, swarm *thisSwarm);            // Constructor - direkter Node-Zugriff
        boid(Ogre::MovableObject *thisObject, Ogre::String orientation, swarm *thisSwarm);      // Constructor - Node-Zugriff über das Objekt
        ~boid();                                                                                // Destructor
        Ogre::Vector3 getBoidPosition();                                                        // gibt die derzeitige Position des Boids zurück - nötig für Cohesion und Separation
        Ogre::Vector3 getBoidVelocity();                                                        // gibt die derzeitige Geschwindigkeit und Richtung des Boids zurück - nötig für Alignement
        std::string getBoidName();                                                              // gibt den Namen eines Boids zurück
        bool move(float timeSinceLastFrame);                                                    // bewegt das Boid

    private:
        swarm *flockMates;                                                                      // pointer auf den Schwarm
        float boidSize;                                                                         // die Größe des Boids
        Ogre::Vector3 boidOrientation;                                                          // Grundlegende Orientierung des Meshes eines Boids
        Ogre::SceneNode *boidNode;                                                              // Pointer auf die SceneNode des Boids - das Objekt, das tatsächlich bewegt wird
        Ogre::Vector3 velocity;                                                                 // derzeitige, bzw. letzte Geschwindigkeit

};
}

#endif

As you can see, I'm using a vector of pointers to boid objects in swarm. On runtime, swarm::move() is called, which iterates through the vector and calls boid::move() for every boid.

bool swarm::move(float timeSinceLastFrame) {
    std::vector<boid*>::iterator iter;
    for ( iter = flock.begin(); iter != flock.end(); iter++ ) {
        (*iter)->move(timeSinceLastFrame);
    }
    return true;
}

boid::move is very complex as it calculates the movement based on a lot of things. I will post the points that - imo - actually matter for here, not every single multiplication, as I don't want to bore you with unneeded stuff. Edited: Okay, now you got most of the code here.

bool boid::move(float timeSinceLastFrame) {

    Ogre::Vector3 cohesion = Ogre::Vector3(0, 0, 0);
    Ogre::Vector3 alignement = Ogre::Vector3(0, 0, 0);
    Ogre::Vector3 separation = Ogre::Vector3(0, 0, 0);

    int cohesionCount = 0;
    int alignementCount = 0;
    int separationCount = 0;

    std::vector<boid*>::iterator iter;
    std::vector<boid*> temp = flockMates->getFlockMates();

        for ( iter = temp.begin(); iter != temp.end(); iter++ ) {

        if ( (*iter) != this ) {

            Ogre::Vector3 p1 = boidNode->getPosition();
            Ogre::Vector3 p2 = (*iter)->getBoidPosition();
            Ogre::Real distance = p1.distance(p2);


            if ( distance <= 10*boidSize ) {

                //cohesion
                cohesionCount++;
                cohesion += (*iter)->getBoidPosition();

                //alignement
                alignementCount++;
                alignement += (*iter)->getBoidVelocity();

            }

            if ( distance <= 2.5*boidSize ) {

                //separation
                separationCount++;

                Ogre::Vector3 away = boidNode->getPosition() - (*iter)->getBoidPosition();
                away.normalise();
                away*=boidSize;

                away/=(distance/2);
                separation += away;

            }
        }
    }

    /* do some more stuff */

    /* actually move the boid */
    return true;
};

So, as you can see, the basic algorithm is quite heavy, agreeably. I run through a vector of boids, call a method of each boid which then again runs through the vector. The other calculations are basic, just throwing variables around so that everything looks good, nothing that exponentially increases the complexity.

Now, as already stated, I would expect the rendering to go slow at a high amount of boids. I would expect framerates to drop and so on, but that simply doesnt happen. Instead the algorithm runs perfectly fine with high and fluid framerates until I'm roughly at 200 boids +/-, then crashes instantly as soon as swarm::move() is called.

I already checked several loose ends. The vector container has enough room for >1 billion elements, so its not that. I can also initialise everything with 10000, 20000 boids, so it's not a basic memory allocation problem either. It just crashes as soon as swarm::move() is called.

So, why does this crash with a number of 200 and a few boids? Why doesn't the Framerate drop over time instead? Thanks for any help on this. I think I put up all necessary information, however, if you need additional code (or whatever), feel free to ask.

Additional Information via Edit: It doesn't change if I trigger swarm::move manually via click instead of via framerate. It still works with <200ish boids, it still doesnt work with more.

Edit²: Edited the boid::move() method and noted where the debugger catches the crash. However, it doesn't catch the crash at boid 1, which would make sense to me, but in boid 314 (in this case). So, the algorithm runs perfectly fine through the same vector 313 times and then crashes on the 314th time. Does that make any sense?

Edit³: Interesting enough, Debugging stuff confuses me far more than actually pointing towards where the problem lies. I updated boid::move() again and I'll throw in the code for swarm::getFlockMates, I'll explain why in a second.

std::vector<boid*> swarm::getFlockMates() {
    return flock;
}

What confuses me is the following. After I changed the calculation of the distance to what Ben voigt suggested, the code crashed at the finalized movement with values, that shouldn't crash it. Instead, I had a cohesionCount of 1.x million, while alignementCount and separationCount were still 0. This - again - doesn't make sense at all to me. cohesionCount cannot be higher than the total amount of boids, which is 1000 at the moment (and therefor crashing). Even if all Boids would be in range of the cohesion (distance < 10*boidSize), the maximum of it could not be higher than the total amount of boids stored in Flock.

This beeing said, is there a problem how I iterate my flock?

Try running it in a debugger so you can figure out what line is causing it to crash. — Brendan Long, Oct 10 '11 at 20:42
+1 for debugger. I'm guessing there's a null pointer dereference going on here, but it could even have to do with OGRE. Also, have you tried running with a smaller number -- say 150 -- and let it run for a good few hours to see if the crash occurs? There may be some edge case that is rare unless there's a high density of boids. — namuol, Oct 10 '11 at 20:49
Agree on running in a debugger. I've run these kinds of simulations before and could believe still getting decent frame rates up to 200 agents - the calculations really aren't that rough. If I had to guess where the bug is, I'd say it's possible that getFlockMates() could return an uninitialized vector, which could point to junk. — dantswain, Oct 10 '11 at 20:50
Right now your troubleshooting is complicated by the presence of OGRE (display-related) datatypes in the model. I'd start by cleaning view stuff out of the model, converting to OGRE vectors just-in-time for display. And then turn off the view completely and see if the model itself crashes ever. — Ben Voigt, Oct 10 '11 at 21:00
Did run through a debugger, made some edits (noted at the end of my question). Will do the Ogre-related cleanup next, but that's gonna take a while. Also, might try the "let it run for a while" tonight. — Aerius, Oct 10 '11 at 21:11
@Aerius: Do all the OGRE types still work right if the display isn't initialized? You might try running OGRE as math library only temporarily. Also make temporary variables for the intermediate steps of that computation, i.e. `pt1 = boidNode->getPosition(); pt2 = (*iter)->getBoidPosition(); distance = pt1.distance(pt2);` and see which of those fails. — Ben Voigt, Oct 10 '11 at 21:14
Not do easy to simply not initialize Ogre. I'm working a bigger engine that's mostly not written by me, but wraps ogre, luabind, cegui and some other stuff. I cannot just not initialize Ogre, as the whole engine I'm working it basically builds upon it. Also, changing what you mentioned makes me crash at a totally different point, with lots of gibberish values, I'll do another edit. — Aerius, Oct 10 '11 at 21:29
What did `valgrind` say? There is far too much here for us to debug it for you. — Lightness Races in Orbit, Oct 10 '11 at 21:29
Do you resize any vectors inside `boid::move`? Why would it return false? — MSN, Oct 10 '11 at 21:42
Never heared of valgrind, will try it. I know by myself that checking crashing code over distance without actually having all of it and debug information isn't exactely an easy task, I wouldn't ask for help however if I could do this easily on my own. Thanks for your patience and help so far already. — Aerius, Oct 10 '11 at 21:43
@MSN: it shouldn't return false at all. type could also be void for that matter. — Aerius, Oct 10 '11 at 21:44
@TomalakGeret'kal valgrind is used for linux only, it seems to me. I'm running windows and my OGRE is compiled against VS10. — Aerius, Oct 10 '11 at 21:47
This may sound stupid to you, but do you actually compile in debug mode? All these "gibberish" values as you call them sound like release mode / optimized mode values, where you can't trust the debugger, ever. It shows totally strange and unlogical stuff, thanks to the rearrangement of the code done by the optimizer. — Xeo, Oct 10 '11 at 21:52
...is it so obvious I'm not used to run with debuggers much? Ouch. No, I was set at release (for whatever reasons, please, don't ask me, I don't know). Fun enough, it now shows some totally unrelated issues. Works with 500, doesnt work with 300 boids. But at least I have a start. Thank you. — Aerius, Oct 10 '11 at 22:05
Hope you can now get more information out of these "gibberish" values. :) Btw, when replying to someone, please prepend @Name to your message to notify the person of your reply, else it might get overlooked. Also, the debugger is one of the most powerful tools a coder has, learn to use it! :) — Xeo, Oct 10 '11 at 22:14
@Tomalak: AFAIK Windows has no Valgrind equivalent, I could be wrong though. — Xeo, Oct 11 '11 at 13:36
@Xeo: A cursory search for `windows valgrind equivalent` yielded [this very helpful first result](http://stackoverflow.com/questions/413477/is-there-a-good-valgrind-substitute-for-windows). I use [a new site that helps me find stuff on the web](http://www.google.com) - highly recommended! — Lightness Races in Orbit, Oct 11 '11 at 13:47

score 4 · Accepted Answer · answered Oct 13 '11 at 11:04

Turns out, the algorithm I proposed here doesn't have any flaws that did lead to this problem. The memory leak was in one of the Lua-Files I had to use, as this engine (and therefor also this algorithm) is used via that. Thanks for the help nevertheless to all of you!

score 2 · Answer 2 · answered Oct 22 '13 at 00:01

In chapter 6.14 of his book, Daniel Shiffman proposes some algorithms to increase boid performance. It suggests that they shouldnt loop for all other existent boids, but only those close to it, pretty much like using a Quadtree.

http://natureofcode.com/book/chapter-6-autonomous-agents/

score 0 · Answer 3 · answered Oct 10 '11 at 20:42

0

This is more a comment than an answer but I cannot write comments... My guess is, that your problem is not in swarm::move() but before that. A crash would be the most sensible thing if one of your pointers does not point to a boid but to some other memory.

However, I can only guess, because I don't know your code or what kind of crash terminates your program. Btw, you should check with a debugger, most debuggers show you exactly where the crash happens and sometimes even why.

answered Oct 10 '11 at 20:42

znkr

1,746
11
14

That you cannot write comments yet is designed to encourage you to contribute with questions answers to prove your worth first! Not to circumvent the system by posting not-answers as answers :( Welcome though! – Lightness Races in Orbit Oct 10 '11 at 21:30
It's just no fun without comments. :-( Why should I answer questions, that are already answered? There are just too many people on stackoverflow. – znkr Oct 10 '11 at 21:33
There are dozens of new questions posted _every few minutes_. But don't be scared of answering older questions! – Lightness Races in Orbit Oct 10 '11 at 21:37
@TomalakGeret'kal: On many of these "help me find my problem" questions, every answer is just a guess, until it turns out to be the right answer. – Ben Voigt Oct 10 '11 at 21:47
@Ben: That's true. I didn't flag this one actually. I guess it was more of a response out of principle regarding the notion of posting something as not-a-comment purely because of lack of rep to post comments. – Lightness Races in Orbit Oct 10 '11 at 23:28

score 0 · Answer 4 · answered Oct 10 '11 at 21:02

Many algorithms have a radius of convergence. Is your timeSinceLastFrame parameter tied to render-rate? If so, then the population size, by affecting the render rate, is in turn fundamentally affecting the convergence behavior of your algorithm. If you start having stability problems, you could also start triggering floating-point exceptions.

Try making timeSinceLastFrame constant, now you'll run slower than real-time, but convergence will no longer be affected by computation and render times.

already did work with timeSinceLastFrame for some other reasons. Setting it to 1, 2 or 100 doesnt change anything, sadly. — Aerius, Oct 10 '11 at 21:07

Flocking Algorithm crashes on 200+ boids

4 Answers4