5

I have the following function:

double 
neville (double xx, size_t n, const double *x, const double *y, double *work);

which performs Lagrange interpolation at xx using the n points stored in x and y. The work array has size 2 * n. Since this is polynomial interpolation, n is in the ballpark of ~5, very rarely more than 10.

This function is aggressively optimized, and supposed to be called in tight loops. Profiling suggests that heap allocating the work array in the loop is bad. Unfortunately, I'm supposed to package this into a function-like class, and clients must be unaware of the work array.

For now, I use a template integer argument for the degree and std::array to avoid dynamic allocation of the work array:

template <size_t n>
struct interpolator
{
    double operator() (double xx) const
    {
        std::array<double, 2 * n> work;
        size_t i = locate (xx); // not shown here, no performance impact
                                // due to clever tricks + nice calling patterns

        return neville (xx, n, x + i, y + i, work.data ());
    }        

    const double *x, *y;
};

It would have been possible to store the work array as a mutable member of the class, but operator() is supposed to be used concurrently by several threads. This version is OK provided you know n at compile time.

Now, I need the n parameter to be specified at run time. I am wondering about something like this:

double operator() (double xx) const
{
    auto work = static_cast<double*> (alloca (n * sizeof (double)));
    ...

Some bells ring when using alloca: I'm of course going to have a cap on n to avoid the alloca call to overflow (anyway it's quite stupid to use degree 100 polynomial interpolation).

I'm quite unconfortable with the approach however:

  • Am I missing some obvious danger of alloca ?
  • Is there a better way to avoid heap allocation here ?
Alexandre C.
  • 55,948
  • 11
  • 128
  • 197
  • 2
    Can't you just write this function in C and use C99 VLAs? –  Apr 30 '13 at 18:45
  • 1
    @KonradRudolph `double neville (double xx, size_t n, const double *x, const double *y, double *work);` - you need operator overloading to write this function? Wow, I never knew! –  Apr 30 '13 at 18:59
  • 2
    @H2CO3 Hehe, caught me. Well, my last argument is that I intensely dislike linking C and C++ code. Of course there is no real problem (if done properly! And I’ve encountered many C libraries which did it wrong and caused me a lot of pain). But then, I see zero benefit to using VLAs over `alloca` but maybe I’m missing something …? – Konrad Rudolph Apr 30 '13 at 19:02
  • @KonradRudolph Agreed on linkage, but C has some nice features (like VLAs) C++ hasn't. What I hate is intermixing C and C++ code at file scope. If separated properly, the two languages can co-operate nicely. –  Apr 30 '13 at 19:04
  • 1
    @KonradRudolph To the benefit: `alloca()` can invoke UB when fails, VLAs are in the C99 standard, `alloca()` is not even POSIX, etc. –  Apr 30 '13 at 19:09
  • 2
    @H2CO3 Read the comments on http://stackoverflow.com/a/1018865/1968. In essence, VLAs have the exact same disadvantages as `alloca`, minus lack of standardisation. But GCC *does* support it and if you want to write portable code you can supply `alloca.h` yourself (although lack of standardisation is a good point, and worth amending my answer). – Konrad Rudolph Apr 30 '13 at 19:13
  • @H2CO3: Not only do VLA's have the same disadvantages, implementations are not required to even support it (see `__STDC_NO_VLA__`). So your entire bit about it being standard doesn't hold much either. Who cares if it's not POSIX? I've never looked at a potential solution and said "if only it were POSIX...". If it works in my domain and it's the best I have, it's what I'll do. Standardization is nice for uniform and pretty syntax, but at the end of the day every compiler has some form of `alloca` one way or another. Make it as pretty as you want (you can even waste time mixing C and C++!). – GManNickG Apr 30 '13 at 19:21
  • @GManNickG Please don't turn agressive. I see the disadvantages, but there's no need to tell me off. –  Apr 30 '13 at 19:22
  • If dynamic stack allocation feels distasteful, you could consider large enough fixed size buffer in thread local storage. If I read your question correctly, it would work in your use case. – hyde Apr 30 '13 at 19:50

3 Answers3

5

I'm quite unconfortable with the approach however:

  • Am I missing some obvious danger of alloca ?

You pointed the one real danger out: stack overflow behaviour is undefined for alloca. In addition, alloca isn’t actually standardised. For instance, Visual C++ has _alloca instead, and GCC by default defines it as a macro. That problem can be circumvented fairly easily, however, by providing a thin wrapper around the few existing implementations.

  • Is there a better way to avoid heap allocation here ?

Not really. C++14 will have a (potentially!) stack allocated variable-length array type. But until then, and when you consider std::array not to be a good fit, go for alloca in cases such as yours.

Minor nitpick though: your code is missing a cast of the return value of alloca. It shouldn’t even compile.

Community
  • 1
  • 1
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • 1
    I’m hurt by the anonymous downvotes. Anybody care to soothe my wounds? – Konrad Rudolph Apr 30 '13 at 19:16
  • Just a guess about downvotes: suggesting use of non-standard feature especially without saying that (in your original version), and maybe suggesting use of what is basically a C library function in C++ have potential to attract purist downvotes, and you did both. – hyde Apr 30 '13 at 19:27
2

There are always a bunch of notes to add to any use of stack memory. As you point out, stacks have finite size and fairly serious misbehavior when that space runs out. A stack overflow will hopefully crash if there are guard pages, but on some platforms and threading environments can sometimes be a silent corruption (bad) or security issue (worse).

Also remember that stack allocation is very fast compared to malloc, (it's just an subtraction from the stack pointer register). But the use of that memory may not be. The side effect of pushing your stack frame down by large amounts is that the cache lines of the leaf functions you are about to call are no longer resident. So any use of that memory needs to go out to the SMP environment to bring the cache lines back into an exclusive (in the MESI sense) state. The SMP bus is a much (!) more constrained environment than L1 cache, and if you are spamming your stack frames around this can be a real scalability issue.

Also, as far as syntax goes, note that both gcc and clang (and Intel's compiler I believe) support the C99 variable length array syntax as a C++ extension. You may not need to actually call the libc alloca() routine at all.

Finally, note that malloc really isn't that slow. If you're dealing with single buffers of the realm of dozens of kilobytes or bigger, the memory bandwidth required to service whatever work you are going to do on them is going to swamp any overhead from malloc.

Basically: alloca() is cute, and has its uses, but unless you have a benchmark ready to prove that you need it, you probably don't and should just stick with traditional allocation.

Andy Ross
  • 11,699
  • 1
  • 34
  • 31
  • 1
    Are you making particular assumptions about cache associativity? Because I don't see why dynamic memory should bring any fewer pages into cache -- in fact it should touch a lot more, because it has to access the heap internal data structures. So it's more likely, not less, to cause eviction of pages used by leaf functions. If you're concerned about those pages not having been in cache to begin with, I don't see why. In programs that make heavy use of large stack allocations, those stack pages are going to be warm in cache. – Ben Voigt Apr 30 '13 at 19:26
  • Function `neville` is small, and does not call anybody. mallocing the work array each time triples runtime of my actual `operator()` (with `std::array`), says profiler. Also `work` has size few dozen bytes maximum. Thanks for the insight however. – Alexandre C. Apr 30 '13 at 19:33
  • Ben: it's not cache footprint, it's the cache line state. Storing to or loading from an L1 cache line requires no traffic outside the local CPU as long as the line is in an E state. So calling a leaf function on top of those lines can be fast, where calling the same function after dropping 14k on the stack will not. Instead, the CPU has to first broadcast the operation to all the other CPUs to allow their snoop logic to see it. For rapidly-called leaf functions, this can be non-trivial. – Andy Ross Apr 30 '13 at 20:02
  • @andy: I see what you're getting at concerning the very first call. (Of course, that involves guard pages, exception handling, updating TLBs -- and bringing the new pages into the exclusive stats is the least of your worries) But what would cause those pages to leave the E state after that? The stack is local to each thread, no other thread will claim ownership. The only thing causing loss of exclusivity on the cache line is if it gets evicted. And a local stack buffer will cause less eviction. Sorry, although I like the direction you're thinking, this answer is wrong. – Ben Voigt Apr 30 '13 at 20:04
  • The SMP logic doesn't know the stack is "local". All it knows is that the CPU is using different lines than it was. And again your use of "pages" leads me to believe you're misunderstanding the issue -- we're talking about cache hardware, not MMU behavior. – Andy Ross Apr 30 '13 at 20:06
  • Stack growth is very expensive -- once. Committing new pages requires synchronized access with system-wide VMM data structures. Possibly flushing pages to disk, zeroing them, etc. – Ben Voigt Apr 30 '13 at 20:06
  • Ben: http://en.wikipedia.org/wiki/MESI_protocol . I don't know that further discussion here will be productive. – Andy Ross Apr 30 '13 at 20:07
  • But it isn't using different lines than it was. Either (a) this thread is growing its stack, or (b) this thread was the last core to use those cache lines in any way, meaning they're still in an E state unless they got evicted. – Ben Voigt Apr 30 '13 at 20:08
  • I know what MESI, MOSI, and MOESI cache coherency protocols are. You haven't explained what you think brings those cache lines out of the exclusive state. – Ben Voigt Apr 30 '13 at 20:08
  • The stack is bouncing around with variable allocations. What could possibly lead you to believe that they **would** be in an unshared state? – Andy Ross Apr 30 '13 at 20:10
  • I know only three reasons why the cache lines would lose exclusivity: (1) actual access from another thread -- won't happen for locally used work buffers (2) eviction because the cache is too small -- and dynamic allocation is more likely to cause this (3) the scheduler moving the thread to another core -- which is going to have the same performance-killing effects on dynamically-allocated buffers used by just one thread. – Ben Voigt Apr 30 '13 at 20:11
  • @Andy: Everything, really. All those allocations are done within a single call tree, from a single thread, remaining (hopefully) on a single processor core. There's no sharing going on -- why would they be in a shared state? – Ben Voigt Apr 30 '13 at 20:13
1

How about this:

double operator() (double xx) const
{
    double work_static[STATIC_N_MAX];
    double* work = work_static;
    std::vector<double> work_dynamic;

    if ( n > STATIC_N_MAX ) {
        work_dynamic.resize(n);
        work = &work_dynamic[0];
    }

    ///...

No non-portable features, exception-safe, and degrades gracefully when n is too large. Of course, you may make work_static a std::array, but I'm not sure what benefit you see in that.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • Sometimes I miss the obvious... I'm considering disallowing values of `n` greater than say 20 (or some preprocessor constant, my real functions are templated on the `y` parameter so full source code is available). This is worth a shot, if allocating `320` bytes on the stack each time does not degrade performance. – Alexandre C. Apr 30 '13 at 19:55
  • @AlexandreC.: One could imagine a `std::vector` implementation with a "small string optimization" which wraps this ugliness. – Ben Voigt Apr 30 '13 at 19:56
  • @AlexandreC. If you think heap alloc may be a bottleneck, then CPU cache impact of an unnecessarily large array in stack might not be desirable either. But, benchmark under real multithreaded load. – hyde Apr 30 '13 at 20:00
  • @hyde: Sure. I don't *think*, I have *measured* heap allocation to be a bottleneck. Here, also, benchmarking is very important. – Alexandre C. Apr 30 '13 at 20:01