4

Suppose I have a dynamic array that I want to sort, I could do

std::vector<int> v(100);
for (int i = 0; i < 100; i++) v[i] = rand();
std::sort(v.begin(), v.end());

but for performance critical code, the initialization overhead is unacceptable, more details at https://stackoverflow.com/a/7269088/3667089

I could also do

int *v = new int[100];
for (int i = 0; i < 100; i++) v[i] = rand();
std::sort(v, v + 100);

but having to manage memory ourselves is bound to memory leak in large codebases.

So it seems that the most feasible approach is

std::unique_ptr<int[]> v(new int[100]);
for (int i = 0; i < 100; i++) v[i] = rand();
std::sort(v, v + 100);

No initialization overhead nor need to worry about memory management, but this returns a long compilation error. Could someone let me know what I am doing wrong?

I am on Ubuntu 14.04, GCC as compiler.

EDIT: Change the code so the data is not already sorted

Community
  • 1
  • 1
user3667089
  • 2,996
  • 5
  • 30
  • 56
  • 3
    `std::sort(v.get(), v.get() + 100)`? – Kerrek SB Apr 18 '16 at 16:59
  • @KerrekSB Wow that was fast and right. If you write an answer I will accept it. – user3667089 Apr 18 '16 at 17:01
  • (By the way, your data is already sorted...) – Kerrek SB Apr 18 '16 at 17:01
  • 1
    How many of these items are you allocating? Having a custom allocator that pulls from a large, pre-allocated slab of memory could solve your problems. As a note the version of GCC you're using is often more relevant than whatever OS is involved. – tadman Apr 18 '16 at 17:01
  • 1
    So the *real* problem you're trying to solve is fast initialization of the vector? – Mark Ransom Apr 18 '16 at 17:12
  • I'm not sure there is an advantage over creating an empty *vector* and calling `reserve(100)`. Also the `rand()` function could be the bottle-neck here. – Galik Apr 18 '16 at 17:29
  • Are you still using a compiler from 2010? If not, your concerns probably aren't justified. A quick test on the linked question shows `useVector` running slightly faster than useArray (with VC++ 2015) or the same speed (with g++ 5.3). – Jerry Coffin Apr 18 '16 at 18:37
  • @JerryCoffin I tried with g++ 5.3 and using raw pointer arrays is still much faster than using std vector – user3667089 Apr 18 '16 at 19:09
  • Then you're almost certainly doing something wrong (probably failing to turn on optimization). – Jerry Coffin Apr 18 '16 at 19:37
  • @JerryCoffin You're missing an important difference between the linked question (value-initialization of `Pixel` does nothing) and OP's question (value-initialization of `int` does zero-initialization). – Barry Apr 18 '16 at 20:04
  • @Barry: If what he really wants is something that acts like an `int`, except that value initialization is a nop, then that's what he should write. – Jerry Coffin Apr 18 '16 at 20:34
  • @JerryCoffin Which is part of my answer. But that doesn't mean that OP is wrong in the assessment that the vector code is slower than the array code. – Barry Apr 18 '16 at 20:45
  • @Barry: I dunno--even with code like his using `int`, I can't find any difference dependable enough to be sure it's real. A quick test with gcc (for example) shows a range from 65 to 98 ms for the array, and 72 to 96 ms for the vector. I'm hard put to justify "fixing" a difference if it's this hard to even be sure it exists. Increasing the size involved doesn't seem to help either. – Jerry Coffin Apr 18 '16 at 21:21

2 Answers2

11

std::sort still needs iterators, and unique_ptr is not an iterator. However, it does hold onto something that can be used as one: its pointer:

std::sort(v.get(), v.get() + 100);

or

std::sort(&*v, &*v + 100);

or

std::sort(&v[0], &v[0] + 100); // N.B. &v[100] invokes undefined behavior

But what you really want is a vector allocator that default-initializes instead of value-initializes. That's where the performance difference is coming from - using std::vector with the default allocator will zero-initialize all your ints first and then assign them some value, whereas your other options do not have this extra zero-initialization step.

Check out Casey's implementation of such a thing and then just do:

std::vector<int, default_init_allocator<int>> v(100); // not zero-initialized
for (int i = 0; i < 100; i++) v[i] = i;
std::sort(v.begin(), v.end());

A different approach which is simpler in the sense that you don't have to deal with allocators (though more annoying on the code-wise) is to introduce a wrapper for int for which value-initialization does not do anything:

template <class T>
struct Wrapper {
    Wrapper() { }
    T value;
};

std::vector<Wrapper<int>> v(100);              // not zero-initialized
for (int i = 0; i < 100; i++) v[i].value = i;  // but have to do this... 

Note that simply using reserve() and push_back() is insufficient - that's still quite a bit more work that needs to be done than simply assigning by index after default-initialization, and if you're latency sensitive enough to have asked this question, that can be significant.

Community
  • 1
  • 1
Barry
  • 286,269
  • 29
  • 621
  • 977
  • Option 2 is funny. :) – erip Apr 18 '16 at 17:03
  • Here's a little [demo](https://godbolt.org/g/QHX8F8) of how a non-initializing allocator affects codegen. – Kerrek SB Apr 18 '16 at 17:10
  • 2
    Note: both 2 and 3 fail if `*v` has an overloaded `operator &`, so should not be used in generic code. Moral: generic code is impossible to write. – Yakk - Adam Nevraumont Apr 18 '16 at 17:27
  • I just measured the timing, speed wise raw pointer > casey's default vector > unique_ptr > std vector, I don't quite understand why because I would expect raw pointer = casey's default vector = unique_ptr > std vector ... – user3667089 Apr 18 '16 at 17:30
  • Option 3 is UB; it breaks a *Requires* on `operator[]`. – T.C. Apr 18 '16 at 20:23
  • @T.C. Hm. Why doesn't `vector::operator[]` have a similar clause? They do the same thing for all practical purposes (for not `vector`). – Barry Apr 18 '16 at 20:46
  • Magnificent detailed answer. There is a typo "assigning" instead of "assing". Also despite applying your solutions to avoid the initialization overhead, raw pointer still performs a notch faster, and I found a possible explanation at http://stackoverflow.com/a/34596711/3667089. Seems like if performance is absolutely critical, there is no way to avoid dealing with the raw pointer mess... – user3667089 Apr 18 '16 at 20:58
  • @T.C. Ah, yep. So also UB there. Though in practice I wonder how much code there is out there that looks like that... – Barry Apr 18 '16 at 21:53
4

Reading the link from the question, it seems you'd be happy using a vector if it didn't call an unnecessary constructor for every element. There are techniques to eliminate this overhead.

std::vector<int> v;
v.reserve(100);
for (int i = 0; i < 100; i++) v.emplace_back(rand());
std::sort(v.begin(), v.end());
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • There are some checks that will slow push_back down, see http://stackoverflow.com/a/20168172/3667089 – user3667089 Apr 18 '16 at 17:23
  • @user3667089, call me when you will be able to measure the performance impact of one branch (which would be eliminated by branch prediction on the 4th iteration) and non-atomic increment. – SergeyA Apr 18 '16 at 17:24
  • 2
    @mark Sadly, there is not a `emplace_back_I_know_there_is_capacity` method in `std::vector`, and I have yet to see a compiler figure that out in any profiling test (or in any ASM output). Maybe you can prove me wrong, I haven't looked for at least a year. – Yakk - Adam Nevraumont Apr 18 '16 at 17:28
  • @SergeyA You can easily measure the performance impact. If you wrote a benchmark, you'd see that there's quite a difference. – Barry Apr 18 '16 at 17:43
  • @Barry that might be true if the initialization is trivial, such as a constant or an incrementing value. The difference quickly fades to insignificance if there is any work being done to generate the value, such as the `rand` in the question. Yes it might still be measurable, and C++ still lets you write to the bare metal if you require it. This answer simply attempts to push the boundary of what you can achieve without heroic effort. – Mark Ransom Apr 18 '16 at 18:06
  • @SergeyA [Here](http://stackoverflow.com/a/24806233/1774667) is a SO post with a link to [live test code](https://ideone.com/WETAzs) where you can tweak it to see the slowdown caused by `push_back` compared to `[]` style access of a trivially constructible object. And yes, `rand()` can easily be more expensive than the overhead from `push_back`/`emplace_back`. I haven't rerun it recently. – Yakk - Adam Nevraumont Apr 18 '16 at 18:17
  • @Yakk, I did. A benchmark test for 100000 iterations dealing with an (array|vector) of 100000 elements give me 1m24.95s of usertime for vector and of 1m19.37s usertime for array. I refuse to see this is any significant difference, warranting the manual array as opposed to vector. Code available on request. – SergeyA Apr 18 '16 at 18:33
  • @MarkRansom Huh? The *difference* is the same either way. Doesn't matter where the data comes from, that part's time is constant regardless of which container we use. – Barry Apr 18 '16 at 18:37
  • @Yakk, nope, randing. As in the original question. – SergeyA Apr 18 '16 at 18:51
  • @sergeyA So 10 million ops (100k iterators of 100k elements) at 2 Ghz is ~200 cycles/item. The push_back overhead was about ~12 cycles. Seems reasonable. If you are working with the equivalent of a 4k image (8 megapixels) at 60 hz (480 megapixels/second), a push_back would cost 5.8 billion cycles/second. Basically, this level of optimization only matter when you approach that point; but getting *near* that point is not hard without playing with ridiculous workloads. (To those watching at home: don't try to process 4k video on a single CPU thread). – Yakk - Adam Nevraumont Apr 18 '16 at 18:52
  • @Yakk, it's actually 3GHz. `model name : Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz` (should have said so earlier) And I guess, I got an insight into your field of work? :) And the way I see it is slightly different. If whatever you are doing takes 80 seconds, it hardly matters if it takes 83 seconds. Going back to your example, I doubt it matters for vieweres if a single frame is rendered in 80 seconds as opposed in 83 seconds - still unacceptable. You really need to be at the boundary of acceptable for this to be of any significance. – SergeyA Apr 18 '16 at 18:54