4

I am writing some code that needs to be as fast as possible without sucking up all of my research time (in other words, no hand optimized assembly).

My systems primarily consist of a bunch of 3D points (atomic systems) and so the code I write does lots of distance comparisons, nearest-neighbor searches, and other types of sorting and comparisons. These are large, million or billion point systems, and the naive O(n^2) nested for loops just won't cut it.

It would be easiest for me to just use a std::vector to hold point coordinates. And at first I thought it will probably be about as fast an array, so that's great! However, this question (Is std::vector so much slower than plain arrays?) has left me with a very uneasy feeling. I don't have time to write all of my code using both arrays and vectors and benchmark them, so I need to make a good decision right now.

I am sure that someone who knows the detailed implementation behind std::vector could use those functions with very little speed penalty. However, I primarily program in C, and so I have no clue what std::vector is doing behind the scenes, and I have no clue if push_back is going to perform some new memory allocation every time I call it, or what other "traps" I could fall into that make my code very slow.

An array is simple though; I know exactly when memory is being allocated, what the order of all my algorithms will be, etc. There are no blackbox unknowns that I may have to suffer through. Yet so often I see people criticized for using arrays over vectors on the internet that I can't but help wonder if I am missing some more information.

EDIT: To clarify, someone asked "Why would you be manipulating such large datasets with arrays or vectors"? Well, ultimately, everything is stored in memory, so you need to pick some bottom layer of abstraction. For instance, I use kd-trees to hold the 3D points, but even so, the kd-tree needs to be built off an array or vector.

Also, I'm not implying that compilers cannot optimize (I know the best compilers can outperform humans in many cases), but simply that they cannot optimize better than what their constraints allow, and I may be unintentionally introducing constraints simply due to my ignorance of the implementation of vectors.

Community
  • 1
  • 1
Nick
  • 5,228
  • 9
  • 40
  • 69
  • 3
    Note the top answer in that question: *So array is twice as quick as vector. **But** after looking at the code in more detail this is expected; as you run across the vector twice and the array only once.* – chris Apr 13 '13 at 20:02
  • I did see that, but look at the second answer too. He demonstrates it's not always intuitive what the appropriate code setup should be. – Nick Apr 13 '13 at 20:05
  • 1
    when in doubt, profile – Philipp Apr 13 '13 at 20:05
  • Also check this for hand optimized assembly: [Is inline assembly language slower than native C++ code?](http://stackoverflow.com/questions/9601427/is-inline-assembly-language-slower-than-native-c-code) – Bo Persson Apr 13 '13 at 20:09
  • @didierc, your suggestion would be...? I mean, I'm using kd-trees for NN searching, etc. – Nick Apr 13 '13 at 20:18
  • Also check [this example](http://stackoverflow.com/a/11639305/597607) of what kind of optimizations a C++ compiler can do, when it reduces 10 lines of inlined template code to 4-5 machine instructions. How's that for code bloat? :-) – Bo Persson Apr 13 '13 at 20:33
  • It depends. In some cases I have the dataset from the beginning (e.g. 100 million atoms and their XYZ coordinates). I just generate a kd-tree for that data and use it in the rest of my program. In other cases, I need to construct the dataset myself according to some specification. In this case, I start from zero and add atoms to the system (in a fashion that depends on those atoms already there) until some criteria is met. The dataset may also be frequently modified, and thus my tree would need constant balancing. – Nick Apr 13 '13 at 20:59
  • Sorry if my comments seemed aggressive, I didn't mean them to be like that. I deleted them as I feel that they could distract the readers from your main question. – didierc Apr 15 '13 at 00:52
  • No problem, and I appreciate the answer you wrote. – Nick Apr 15 '13 at 01:01
  • std::vector is a pointer to a classic array on the heap with counters to track current size[fill] and total space allocated. When it gets full it allocates a bigger array(1.5 or 2x) and copies the data over. just after declaring the vector foo use foo.reserve(1'234'567) and it will start out with a block of space for that many elements, avoiding reallocation's. foo.capacity() and foo.size() give currently reserved space and currently used space. foo.shrink_to_fit() asks it to reallocate to a smaller memory block to fits the currently used data.("ask" because the standard doesn't require it) – Max Power Jul 08 '22 at 08:53

7 Answers7

2

all depends on this how you implement your algorithms. std::vector is such general container concept that gives us flexibility but leaves us with freedom and responsibility of structuring implementation of algorithm deliberately. Most of the efficiency overhead that we will observe from std::vector comes from copying. std::vector provides a constructor which lets you initialize N elements with value X, and when you use that, the vector is just as fast as an array.

I did a tests std::vector vs. array described here,

#include <cstdlib>
#include <vector>

#include <iostream>
#include <string>

#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>

class TestTimer
{
public:
    TestTimer(const std::string & name) : name(name),
        start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
    {
    }

    ~TestTimer()
    {
        using namespace std;
        using namespace boost;

        posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
        posix_time::time_duration d = now - start;

        cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
            " seconds" << endl;
    }

private:
    std::string name;
    boost::posix_time::ptime start;
};

struct Pixel
{
    Pixel()
    {
    }

    Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
    {
    }
    

    unsigned char r, g, b;
};

void UseVector()
{
    TestTimer t("UseVector");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels;
        pixels.resize(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
}

void UseVectorPushBack()
{
    TestTimer t("UseVectorPushBack");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels;
            pixels.reserve(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
            pixels.push_back(Pixel(255, 0, 0));
    }
}

void UseArray()
{
    TestTimer t("UseArray");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);

        for(int i = 0 ; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }

        free(pixels);
    }
}
void UseVectorCtor()
{
    TestTimer t("UseConstructor");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));
    }
}

int main()
{
    TestTimer t1("The whole thing");

    UseArray();
    UseVector();
    UseVectorCtor();
    UseVectorPushBack();

    return 0;
}

and here are results (compiled on Ubuntu amd64 with g++ -O3):

UseArray completed in 0.325 seconds
UseVector completed in 1.23 seconds
UseConstructor completed in 0.866 seconds
UseVectorPushBack completed in 8.987 seconds
The whole thing completed in 11.411 seconds

clearly push_back wasn't good choice here, using constructor is still 2 times slower than array. Now, providing Pixel with empty copy Ctor:

Pixel(const Pixel&) {}

gives us following results:

UseArray completed in 0.331 seconds
UseVector completed in 0.306 seconds
UseConstructor completed in 0 seconds
UseVectorPushBack completed in 2.714 seconds
The whole thing completed in 3.352 seconds

So in summary: re-think your algorithm, otherwise, perhaps resort to a custom wrapper around New[]/Delete[]. In any case, the STL implementation isn't slower for some unknown reason, it just does exactly what you ask; hoping you know better. In the case when you just started with vectors it might be surprising how they behave, for example this code:

class U{
    int i_;
public:
    U(){}
    U(int i) : i_(i) {cout << "consting " << i_ << endl;}
    U(const U& ot) : i_(ot.i_) {cout << "copying " << i_ << endl;}
};

int main(int argc, char** argv)
{   
    std::vector<U> arr(2,U(3));
    arr.resize(4);
    return 0;
}

results with:

consting 3

copying 3

copying 3

copying 548789016

copying 548789016

copying 3

copying 3

Community
  • 1
  • 1
4pie0
  • 29,204
  • 9
  • 82
  • 118
  • 2
    Thanks. This gets straight to the critical issue of what I was asking: namely, that if you didn't know in advance to provide an empty copy constructor to Pixel (which, for someone whose background is almost entirely C, is reasonable to assume) you would have bad performance and never known that it could potentially be better. – Nick Apr 13 '13 at 21:18
  • std::vector is very smart, by default as we see might give you big penalty but if used correctly can be as fast as raw array – 4pie0 Apr 13 '13 at 21:20
  • I guess the takeaway from this is that if you are absolutely naive about std::vectors, then you should stick with arrays. If you are willing to invest some of your time into learning about std::vectors, then the benefits that come from using them may outweigh the cost of spending time learning about them. – Nick Apr 13 '13 at 21:24
  • all depends, first important thing is: do you really need this slight better performance? don't get me wrong, I mean sometimes we spend this time to optimize such things like this 0.1s difference when other part of app takes 0.5s. if this is important part of your app then yes, you will learn how to use vector and you will get your prformance – 4pie0 Apr 13 '13 at 21:32
  • What's with `emplace_back`? (Though I think the reason why `push_back` is slow is not the copying, but the range checks required for growing.) – dyp Apr 13 '13 at 21:49
  • Oh, and there's the dirty trick `reserve` + `operator []`. – dyp Apr 13 '13 at 22:09
1

Vectors guarantee that the underlying data is a contiguous block in memory. The only sane way to guarantee this is by implementing it as an array.

Memory reallocation on pushing new elements can happen, because the vector can't know in advance how many elements you are going to add to it. But when you know it in advance, you can call reserve with the appropriate number of entries to avoid reallocation when adding them.

Vectors are usually preferred over arrays because they allow bound-checking when accessing elements with .at(). That means accessing indices outside of the vector doesn't cause undefined behavior like in an array. This bound-checking does however require additional CPU cycles. When you use the []-operator to access elements, no bound-checking is done and access should be as fast as an array. This however risks undefined behavior when your code is buggy.

Philipp
  • 67,764
  • 9
  • 118
  • 153
  • If this guy is really manipulating such large datasets, then probably he should either know beforehand what sizes he needs and therefore avoid reallocation or he can end up writing good prediction algorithm for minimizing these reallocations. In any case, better done with safe and robust facility such as `std::vector`, rather than bare arrays. – Alexander Shukaev Apr 13 '13 at 20:20
1

People who invented STL, and then made it into the C++ standard library, are expletive deleted smart. Don't even let yourself imagine for one little moment you can outperform them because of your superior knowledge of legacy C arrays. (You would have a chance if you knew some Fortran though).

With std::vector, you can allocate all memory in one go, just like with C arrays. You can also allocate incrementally, again just like with C arrays. You can control when each allocation happens, just like with C arrays. Unlike with C arrays, you can also forget about it all and let the system manage the allocations for you, if that's what you want. This is all absolutely necessary, basic functionality. I'm not sure why anyone would assume it is missing.

Having said all that, go with arrays if you find them easier to understand.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
  • Sigh... What's with the assumption on SO that everyone who asks a question has an IQ of 60? I *know* that compiler writers are very smart. I made this incredibly clear in my original post because I even expected a barrage of replies related to that fact. (I even know something about compiler optimization techniques myself.) My point is that with vectors, there are *clearly* things I could do with them that would be suboptimal without knowing their precise implementation. And no matter how ridiculously capable the compiler is, it CANNOT optimize away things that require assumptions. – Nick Apr 13 '13 at 21:06
  • Otherwise you would get all sorts of errors when you did something in your program that the compiler assumed otherwise. The second paragraph of your text is very useful though; I did not realize vectors had so many allocation options. That fact convinces me that vectors probably are the best way to go. – Nick Apr 13 '13 at 21:06
  • (Also, you have even less chance of outperforming Fortran compiler writers because these are often people targeting absolutely performance critical application on supercomputers. That's one of the few areas where Fortran is still used nowadays.) – Nick Apr 13 '13 at 21:11
  • Your point plainly doesn't hold. You don't need to know anything at all about the implementation. You just need to read the documentation. Everything you need to know is there. By way of example, you don't analyze an implementation of `malloc` every time you want to call it, do you? `malloc` is many, many times more complicated than anything in `vector`, and has greater impact on the performance of your program. Yet you trust it to do the right thing for you. – n. m. could be an AI Apr 13 '13 at 21:26
  • I mean, you can outperform `std::vector` with a Fortran array. – n. m. could be an AI Apr 13 '13 at 21:32
  • Well, I know that if I call `malloc` once that it allocates memory and is done. If I limit `malloc` calls to outside of loops, then I know performance isn't hurt by it. The same can't be said for naive `std::vector` usage (except, as you've said, you can control this; which I did not know). – Nick Apr 13 '13 at 21:36
  • OK so basically what you want to know how to allocate all memory for a `vector` in one go. This is a simple, one-line question with a simple, one-line answer (use either `vector::reserve` or `vector::resize`). You can call `push_back` in a loop without calling `reserve` first, just as you can call `realloc` in a loop, adding one element at a time. Both are very similar novice-level mistakes, with similar results. Not anything critical by the way if you do it outside of your main loop. This normally takes only logarithmic time in both cases. – n. m. could be an AI Apr 13 '13 at 21:55
  • 1
    You have a lot of knowledge, and some excellent answers to mine and other people's questions. (But you'd get upvoted a lot more without the sly, passive-aggressiveness). – Nick Apr 13 '13 at 21:59
1

I am not really advising you to go either for arrays or vectors, because I think that for your needs they may not be totally fit.

You need to be able to organize your data efficiently, so that queries would not need to scan the whole memory range to get the relevant data. So you want to group the points which are more likely to be selected together close to each other.

If your dataset is static, then you can do that sorting offline, and make your array nice and tidy to be loaded up in memory at your application start up time, and either vector or array would work (provided you do the reserve call up front for vector, since the default allocation growth scheme double up the size of the underlying array whenever it gets full, and you wouldn't want to use up 16Gb of memory for only 9Gb worth of data).

But if your dataset is dynamic, it will be difficult to do efficient inserts in your set with a vector or an array. Recall that each insert within the array would create a shift of all the successor elements of one place. Of course, an index, like the kd tree you mention, will help by avoiding a full scan of the array, but if the selected points are scattered accross the array, the effect on memory and cache will essentially be the same. The shift would also mean that the index needs to be updated.

My solution would be to cut the array in pages (either list linked or array indexed) and store data in the pages. That way, it would be possible to group relevant elements together, while still retaining the speed of contiguous memory access within pages. The index would then refer to a page and an offset in that page. Pages wouldn't be filled automatically, which leaves rooms to insert related elements, or make shifts really cheap operations.

Note that if pages are always full (excepted for the last one), you still have to shift every single one of them in case of an insert, while if you allow incomplete pages, you can limit a shift to a single page, and if that page is full, insert a new page right after it to contain the suplementary element.

Some things to keep in mind:

  • array and vector allocation have upper limits, which is OS dependent (these limits might be different)

On my 32bits system, the maximum allowed allocation for a vector of 3D points is at around 180 millions entries, so for larger datasets, on would have to find a different solution. Granted, on 64bits OS, that amount might be significantly larger (On windows 32bits, the maximum memory space for a process is 2Gb - I think they added some tricks on more advanced versions of the OS to extend that amount). Admittedly memory will be even more problematic for solutions like mine.

  • resizing of a vector requires allocating the new size of the heap, copy the elements from the old memory chunck to the new one.

So for adding just one element to the sequence, you will need twice the memory during the resizing. This issue may not come up with plain arrays, which can be reallocated using the ad hoc OS memory functions (realloc on unices for instance, but as far as I know that function doesn't make any guarantee that the same memory chunck will be reused). The problem might be avoided in vector as well if a custom allocator which would use the same functions is used.

  • C++ doesn't make any assumption about the underlying memory architecture.

vectors and arrays are meant to represent contiguous memory chunks provided by an allocator, and wrap that memory chunk with an interface to access it. But C++ doesn't know how the OS is managing that memory. In most modern OS, that memory is actually cut in pages, which are mapped in and out of physical memory. So my solution is somehow to reproduce that mechanism at the process level. In order to make the paging efficient, it is necessary to have our page fit the OS page, so a bit of OS dependent code will be necessary. On the other hand, this is not a concern at all for a vector or array based solution.

So in essence my answer is concerned by the efficiency of updating the dataset in a manner which will favor clustering points close to each others. It supposes that such clustering is possible. If not the case, then just pushing a new point at the end of the dataset would be perfectly alright.

didierc
  • 14,572
  • 3
  • 32
  • 52
0

Although I do not know the exact implementation of std:vector, most list systems like this are slower than arrays as they allocate memory when they are resized, normally double the current capacity although this is not always the case.

So if the vector contains 16 items and you added another, it needs memory for another 16 items. As vectors are contiguous in memory, this means that it will allocate a solid block of memory for 32 items and update the vector. You can get some performance improvements by constructing the std:vector with an initial capacity that is roughly the size you think your data set will be, although this isn't always an easy number to arrive at.

Ben Morris
  • 81
  • 6
  • 1
    This is unavoidable, unless you use a `std::list`-like container without contiguous memory. But this is completely irrelevant to the array vs vector question. – rubenvb Apr 13 '13 at 20:18
  • Not completely irrelevant, the poster asked about the cost of push_back and this somewhat answers that query. As long as the capacity of the vector is sufficient to hold the item then it is not a performance concern. – Ben Morris Apr 13 '13 at 20:23
  • you cannot resize an array, so in principle when comparing the two containers, the single allocation at the start of the algorithm should not make the difference. – rubenvb Apr 15 '13 at 12:26
0

For operation that are common between vectors and arrays (hence not push_back or pop_back, since array are fixed in size) they perform exactly the same, because -by specification- they are the same.

vector access methods are so trivial that the simpler compiler optimization will wipe them out. If you know in advance the size of a vector, just construct it by specifyinfg the size or just call resize, and you will get the same you can get with a new [].

If you don't know the size, but you know how much you will need to grow, just call reserve, and you get no penality on push_back, since all the required memory is already allocated.

In any case, relocation are not so "dumb": the capacity and the size of a vector are two distinct things, and the capacity is typically doubled upon exhaustion, so that relocation of big amounts are less and less frequent.

Also, in case you know everything about sizes, and you need no dynamic memory and want the same vector interface, consider also std::array.

Emilio Garavaglia
  • 20,229
  • 2
  • 46
  • 63
0

Sounds like you need gigs of RAM so you're not paging. I tend to go along with @Philipp's answer, because you really really want to make sure it's not re-allocating under the hood

but

what's this about a tree that needs rebalancing? and you're even thinking about compiler optimization?

Please take a crash course in how to optimize software. I'm sure you know all about Big-O, but I bet you're used to ignoring the constant factors, right? They might be out of whack by 2 to 3 orders of magnitude, doing things you never would have thought costly. If that translates to days of compute time, maybe it'll get interesting.

And no compiler optimizer can fix these things for you.

If you're academically inclined, this post goes into more detail.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • Honestly, I'm not really sure what you're getting at. I've been programming for 12 years, know all about Big-O (and the fact that it's also used incorrectly wrt its precise definition). Most of my programming has been in C, CoffeeScript, Fortran, Z80 assembly, and a little bit of Clojure, Haskell, and Forth. This is my first foray into high performance C++, so I asked a simple question about what to use as a basis for which all of my other code is written upon. I don't know what it is that you're trying to tell me. – Nick Apr 14 '13 at 00:42
  • In other words, I think you and I have a mismatch between the perceived skills that I have. Algorithmically, I'm good. C++, not at all. – Nick Apr 14 '13 at 00:46
  • @Nick: That's good experience, but [*look here*](http://scicomp.stackexchange.com/a/1870/1262). There's a decently well-written (IMHO) program that, in 6 stages, was sped up by 730 times, without changing the basic algorithm. Some programmers (a small percentage) know how to do that. I'm just trying to increase the percentage. – Mike Dunlavey Apr 14 '13 at 00:46
  • I'm a little confused. You stated in your post not to worry about compiler optimization ("you're even *thinking* about compiler optimization?") and then you linked to a post that is all about compiler optimization without regard to the algorithm. – Nick Apr 14 '13 at 00:52
  • @Nick: It's not about compiler optimization (the stuff you get with the -O3 flag). It's about how you the programmer find stuff that the program is spending a big fraction of time doing, that by tweaking the code you can eliminate, thereby getting a certain speedup ratio. Then you do it again, and something that was less of a drain before is now a higher percentage, so you get an additional factor. Over multiple iterations it really compounds a lot. *Then* you turn on the -O3 and let it do its thing. – Mike Dunlavey Apr 14 '13 at 00:56
  • Alright, that sounds good. Thanks. – Nick Apr 14 '13 at 01:04