C++ replacement for C99 VLAs (goal: preserve performance)

Question

I am porting some C99 code that makes heavy use of variable length arrays (VLA) to C++.

I replaced the VLAs (stack allocation) with an array class that allocates memory on the heap. The performance hit was huge, a slowdown of a factor of 3.2 (see benchmarks below). What fast VLA replacement can I use in C++? My goal is to minimize performance hit when rewriting the code for C++.

One idea that was suggested to me was to write an array class that contains a fixed-size storage within the class (i.e. can be stack-allocated) and uses it for small arrays, and automatically switches to heap allocation for larger arrays. My implementation of this is at the end of the post. It works fairly well, but I still cannot reach the performance of the original C99 code. To come close to it, I must increase this fixed-size storage (MSL below) to sizes which I am not comfortable with. I don't want to allocate too-huge arrays on the stack even for the many small arrays that don't need it because I worry that it will trigger a stack overflow. A C99 VLA is actually less prone to this because it will never use more storage than needed.

I came upon std::dynarray, but my understanding is that it was not accepted into the standard (yet?).

I know that clang and gcc support VLAs in C++, but I need it to work with MSVC too. In fact better portability is one of the main goals of rewriting as C++ (the other goal being making the program, which was originally a command line tool, into a reusable library).

Benchmark

MSL refers to the array size above which I switch to heap-allocation. I use different values for 1D and 2D arrays.

Original C99 code: 115 seconds.
MSL = 0 (i.e. heap allocation): 367 seconds (3.2x).
1D-MSL = 50, 2D-MSL = 1000: 187 seconds (1.63x).
1D-MSL = 200, 2D-MSL = 4000: 143 seconds (1.24x).
1D-MSL = 1000, 2D-MSL = 20000: 131 (1.14x).

Increasing MSL further improves performance more, but eventually the program will start returning wrong results (I assume due to stack overflow).

These benchmarks are with clang 3.7 on OS X, but gcc 5 shows very similar results.

Code

This is the current "smallvector" implementation I use. I need 1D and 2D vectors. I switch to heap-allocation above size MSL.

template<typename T, size_t MSL=50>
class lad_vector {
    const size_t len;
    T sdata[MSL];
    T *data;
public:
    explicit lad_vector(size_t len_) : len(len_) {
        if (len <= MSL)
            data = &sdata[0];
        else
            data = new T[len];
    }

    ~lad_vector() {
        if (len > MSL)
            delete [] data;
    }

    const T &operator [] (size_t i) const { return data[i]; }
    T &operator [] (size_t i) { return data[i]; }

    operator T * () { return data; }
};


template<typename T, size_t MSL=1000>
class lad_matrix {
    const size_t rows, cols;
    T sdata[MSL];
    T *data;

public:
    explicit lad_matrix(size_t rows_, size_t cols_) : rows(rows_), cols(cols_) {
        if (rows*cols <= MSL)
            data = &sdata[0];
        else
            data = new T[rows*cols];
    }

    ~lad_matrix() {
        if (rows*cols > MSL)
            delete [] data;
    }

    T const * operator[] (size_t i) const { return &data[cols*i]; }
    T * operator[] (size_t i) { return &data[cols*i]; }
};

There is no substitute for VLAs when it comes to overhead. The storage for VLA is completely cost-free. In fact, in most cases it's totally free, above the existing overhead of a function call. Can't really do better than 0% cost, so if MSVC does not have VLAs, you have no choice but use some other alternative, for VLA, and take a performance hit. — Sam Varshavchik, Apr 03 '16 at 20:04
@SamVarshavchik My question is: what is a good alternative that minimizes the performance hit? There's one I show in the question. I don't know if there is anything better. I also don't know if there is a way to allocate memory on the stack in C++ that would allow me to emulate how VLAs work (possibly one or more platform-specific way ... I need this for Windows/Linux/OS X) — Szabolcs, Apr 03 '16 at 20:06
If you are happy to go "platform specific" then `GCC` does VLAs as an extension and runs on all those platforms. — Galik, Apr 03 '16 at 20:14
@Galik I can go platform-specific to a reasonable extent (i.e. using OS-specific calls in my array class implementation). But I still need the code to work with MSVC. And I don't want to specialize anything other than this class for different platforms. Is it possible to use a VLA *within a class* or to wrap it with a class? My understanding so far was that it isn't. But if it is, I would do that (for when it's compiled with gcc) and use the above solution with MSVC. — Szabolcs, Apr 03 '16 at 20:18
No it's not. Your only option is to allocate a vector/array on the heap. That's it. — Sam Varshavchik, Apr 03 '16 at 20:29
There is also `alloca` (plaform-specific function, but exists on Linux/Windows/OS X): http://man7.org/linux/man-pages/man3/alloca.3.html It dynamically allocates memory on the stack. — tmlen, Apr 03 '16 at 20:37
@tmlen Is it possible to use `alloca` to create an on-stack array class? The difficulty I'm encountering is that if I call `alloca` in the constructor of the class, it will use the constructor's stack (and the pointer won't be valid once the constructor has returned). — Szabolcs, Apr 03 '16 at 21:03
You are barking up a very wrong tree, VLA memory is not faster than new[] memory. You have to document the T you used. Crystal ball says that it is foat or double. What you lost is auto-vectorization. Which makes code 3 to 4 times faster. You need at least VS2012 to get that back. Use the built-in auto-vectorization and auto-parallelization diagnostic options to find out what the optimizer did. — Hans Passant, Apr 03 '16 at 21:11
@HansPassant `T` is always `int`, `char` or `bool` in this application, never floating point. It is a complex combinatorial algorithm. Also, if the root of the performance problem is that auto-vectorization is lost, then why does the optimization I tried work as well as it does? — Szabolcs, Apr 03 '16 at 21:14
@Hans I think my original post was misleading. The benchmarks I show are with clang, not MSVC. The code still must run with MSVC, but I'm working on OS X mainly, and testing on Windows when needed. — Szabolcs, Apr 03 '16 at 21:19
Can you show the benchmark code? Perhaps there might be a way to use more memory for improved speed? — rubenvb, Apr 03 '16 at 21:27
`alloca` would need to be called in the function whose stack should be used. That is, not in the constructor of the vector class (or the initialization list.) The class could take the pointer as a constructor argument, like `lad_vector vec( (int*)alloca(10 * sizeof(int)), 10 );`. Maybe make a macro for this (but not an inline function), to get a syntax like `lad_vector vec = MAKE_LADVECTOR(10);` — tmlen, Apr 03 '16 at 21:33
I think your C99 is broken, you cannot portably rely on VLAs that large. I'm surprised that you never get any stack overflow. It looks strange that you get better performance for very high MSL values: for that much data, I would expect the cost of heap allocation to be very small compared to the cost of processing the data. — Benoît, Apr 03 '16 at 21:38
@Benoît I think the implementation allocates a lot inside calculation heavy code, or so it seems. Yet that doesn't explain away the improvement, on the contrary. — rubenvb, Apr 03 '16 at 21:47
*Increasing MSL further improves performance more, but eventually the program will start returning wrong results (I assume due to stack overflow).* I don't see how stack overflow can give you wrong results. On any sane system, at worst you should get a segfault. (Barring something very unusual, like overflowing by so much that you wind up in some other area of valid memory.) So maybe you should look for a bug. — Nate Eldredge, Apr 04 '16 at 04:10
Might this be relevent or helpful: [Looking for C++ STL-like vector class but using stack storage](http://stackoverflow.com/a/354481/1734730) — Nathan Cooper, Apr 04 '16 at 15:29

Yakk - Adam Nevraumont · Accepted Answer · 2016-04-04T13:21:11.643

Create a large buffer (MB+) in thread-local storage. (Actual memory on heap, management in TLS).

Allow clients to request memory from it in FILO manner (stack-like). (this mimics how it works in C VLAs; and it is efficient, as each request/return is just an integer addition/subtraction).

Get your VLA storage from it.

Wrap it pretty, so you can say stack_array<T> x(1024);, and have that stack_array deal with construction/destruction (note that ->~T() where T is int is a legal noop, and construction can similarly be a noop), or make stack_array<T> wrap a std::vector<T, TLS_stack_allocator>.

Data will be not as local as the C VLA data is because it will be effectively on a separate stack. You can use SBO (small buffer optimization), which is when locality really matters.

A SBO stack_array<T> can be implemented with an allocator and a std vector unioned with a std array, or with a unique ptr and custom destroyer, or a myriad of other ways. You can probably retrofit your solution, replacing your new/malloc/free/delete with calls to the above TLS storage.

I say go with TLS as that removes need for synchronization overhead while allowing multi-threaded use, and mirrors the fact that the stack itself is implicitly TLS.

Stack-buffer based STL allocator? is a SO Q&A with at least two "stack" allocators in the answers. They will need some adaption to automatically get their buffer from TLS.

Note that the TLS being one large buffer is in a sense an implementation detail. You could do large allocations, and when you run out of space do another large allocation. You just need to keep track each "stack page" current capacity and a list of stack pages, so when you empty one you can move onto an earlier one. That lets you be a bit more conservative in your TLS initial allocation without worrying about running OOM; the important part is that you are FILO and allocate rarely, not that the entire FILO buffer is one contiguous one.

I would like to know why this was downvoted. The use case is replacing C99 VLAs in code originally written in C99. This means that arrays are always destroyed in the reverse order of their creation, so the idea to take their storage from a "manually managed stack" should work ... If there's an expected problem, I would like to know. — Szabolcs, Apr 03 '16 at 21:09
@sza small buffer optimization (what you already tried), storing small arrays "locally". Really, only try if the above fails performance tests first. — Yakk - Adam Nevraumont, Apr 03 '16 at 21:10
@Szabolcs As a theory, other than the TLS detail, my answer lines up with the last idea of 5gon12eder; maybe someone didn't like how similar they are. If the TLS detail was folded into 5gon12's answer, mine would be redundant; at the same time, I strongly suspect this solution is the only one that has a chance to both solve your portability and performance problems. — Yakk - Adam Nevraumont, Apr 03 '16 at 22:38
Your idea with the FILO works well and closes the performance gap. I will accept the answer once I finalized the implementation (maybe tomorrow). — Szabolcs, Apr 04 '16 at 09:18

score 17 · Answer 2 · answered Apr 03 '16 at 20:48

I think you have already enumerated most options in your question and the comments.

Use std::vector. This is the most obvious, most hassle-free but maybe also the slowest solution.
Use platform-specific extensions on those platforms that provide them. For example, GCC supports variable-length arrays in C++ as an extension. POSIX specifies alloca which is widely supported to allocate memory on the stack. Even Microsoft Windows provides _malloca, as a quick web search told me.

In order to avoid maintenance nightmares, you'll really want to encapsulate these platform dependencies into an abstract interface that automatically and transparently chooses the appropriate mechanism for the current platform. Implementing this for all platforms will be a bit of work but if this single feature accounts for 3 × speed differences as you're reporting, it might be worth it. As a fallback for unknown platforms, I'd keep std::vector in reserve as a last resort. It is better to run slow but correctly than to behave erratic or not run at all.
Build your own variable-sized array type that implements a “small array” optimization embedded as a buffer inside the object itself as you have shown in your question. I'll just note that I'd rather try using a union of a std::array and a std::vector instead of rolling my own container.

Once you have a custom type in place, you can do interesting profiling such as maintaining a global hash table of all occurrences of this type (by source-code location) and recording each allocation size during a stress test of your program. You can then dump the hash table at program exit and plot the distributions in allocation sizes for the individual arrays. This might help you to fine-tune the amount of storage to reserve for each array individually on the stack.
Use a std::vector with a custom allocator. At program startup, allocate a few megabytes of memory and give it to a simple stack allocator. For a stack allocator, allocation is just comparing and adding two integers and deallocation is simply a subtraction. I doubt that the compiler-generated stack allocation can be much faster. Your “array stack” would then pulsate correlated to your “program stack”. This design would also have the advantage that accidental buffer overruns – while still invoking undefined behavior, trashing random data and all that bad stuff – wouldn't as easily corrupt the program stack (return addresses) as they would with native VLAs.

Custom allocators in C++ are a somewhat dirty business but some people do report they're using them successfully. (I don't have much experience with using them myself.) You might want to start looking at cppreference. Alisdair Meredith who is one of those people that promote the usage of custom allocators gave a double-session talk at CppCon'14 titled “Making Allocators Work” (part 1, part 2) that you might find interesting as well. If the std::allocator interface it too awkward to use for you, implementing your own variable (as opposed to dynamically) sized array class with your own allocator should be doable as well.

Union of classes sounds dangerous destructors arent executed for unions. — Alexander Oh, Apr 03 '16 at 20:55
@Alex This is safe since C++11. Of course, you have to take care to write your destructors accordingly to invoke the appropriate destructor of the currently active member of the `union`. — 5gon12eder, Apr 03 '16 at 21:04
Pool allocators with `std::vector` has to be the best of all worlds. — Lightness Races in Orbit, Apr 04 '16 at 00:53

score 9 · Answer 3 · answered Apr 03 '16 at 23:49

9

Regarding support for MSVC:

MSVC has _alloca which allocates stack space. It also has _malloca which allocates stack space if there is enough free stack space, otherwise falls back to dynamic allocation.

You cannot take advantage of the VLA type system, so you would have to change your code to work based in a pointer to first element of such an array.

You may end up needing to use a macro which has different definitions depending on the platform. E.g. invoke _alloca or _malloca on MSVC, and on g++ or other compilers, either calls alloca (if they support it), or makes a VLA and a pointer.

Consider investigating ways to rewrite the code without needing to allocate an unknown amount of stack. One option is to allocate a fixed-size buffer that is the maximum you will need. (If that would cause stack overflow it means your code is bugged anyway).

answered Apr 03 '16 at 23:49

M.M

138,810
21
208
365

I'd be worried about alloca using the wrong stack frame if it's not being called explicitly from the same function where the object is declared. – Random832 Apr 04 '16 at 06:38
@Random832 not sure what you are talking about, I am suggesting replacing VLA declarations with alloca as a possible option – M.M Apr 04 '16 at 06:44
I think I got confused and thought you were talking about hiding this behavior behind a class. – Random832 Apr 04 '16 at 06:48
@Random832: Both `_alloca()` and `alloca()` do the right thing if the call for the function in which they're used is properly inlined. You can make sure this happens using `__forceinline` and `__attribute__((always_inline))`. I use this extensively in C90 code (which also doesn't have VLAs). – alecov Apr 04 '16 at 14:02

C++ replacement for C99 VLAs (goal: preserve performance)

Benchmark

Code

3 Answers3

Linked