How to organize time-intensive code to avoid overhead

Question

I am trying to teach myself how to write faster code (code using less instructions). I want to create an artificial neural network (ANN). If you know nothing about ANNs, you may still be able to help me as my question pertains more to writing faster code than ANNs. Basically, I am going to have a big array of doubles that I need to perform lots of math on. I could allocate my arrays like this:

class Network {
    double *weights;
    double *outputs;

    public:
    Network()
    {

    }
    Network(int * sizes, int numofLayers)
    {
        int sum = 0;
        int neuron_count = 0;
        // this just ensures my weight array is the right size
        for(int i = 0; i < numofLayers-1; i++)
        {
            neuron_count += sizes[i];
            sum = sum + sizes[i]*sizes[i+1];
        }
        neuron_count += sizes[numofLayers-1];
        weights = new double[sum];
        outputs = new double[neuron_count];
    }
    ~Network()
    {
        delete[] weights;
        delete[] outputs;
    }
};

However, I dislike this because I use "new" and I know I will probably open myself up to a bunch of memory management problems later on. I know that stack allocation is faster and I shouldn't use dynamic memory if I can help it based on this excerpt:

"Dynamic memory allocation is done with the operators new and delete or with the functions malloc and free. These operators and functions consume a significant amount of time. A part of memory called the heap is reserved for dynamic allocation. The heap can easily become fragmented when objects of different sizes are allocated and deallocated in random order. The heap manager can spend a lot of time cleaning up spaces that are no longer used and searching for vacant spaces. This is called garbage collection. Objects that are allocated in sequence are not necessarily stored sequentially in memory. They may be scattered around at different places when the heap has become fragmented. This makes data caching inefficient"

Optimizing software in C++ An optimization guide for Windows, Linux and Mac platforms By Agner Fog.

However, the arrays weights and outputs will be used by many functions I'd like to create in the class Network; if they are local and get deallocated, that won't work. I feel stuck either using the new keyword or just making a gigantic function for pretty much the entire neural network.

In a normal circumstance, I would say readability would be more important for upkeeping code, but I am not worried about that as this is more to learn about writing fast code. If people who write code for time-heavy algorithms write big functions in order to make things fastest, that's fine. I'd just like to know.

On the other hand, if my arrays are only allocated once and will be used throughout the whole program, is it smarter to use heap allocation because that problematic overhead should only happen once? Then focus on using intrinsics or something in the math heavy code. Are there any other downsides like, for example, if I access the heap a lot so that the memory there is moved to the cache, is this more intensive than accessing the memory on the stack a lot so that the stack is moved to the cache (even if the heap should stay constant)? Do people who write very fast code absolutely avoid new always? If so, what alternatives do I have to organizing my program this way and still keep my arrays an arbitrary size specified by the user?

Work on your design. Optimize at the design level and the optimizations will flow to the implementation. Remove unnecessary loops. Design for data cache optimization. Design for execution cache optimization. — Thomas Matthews, Jun 11 '17 at 18:21
@YePhIcK whoops, sorry about that. I pasted that code in and didn't realize it was the wrong one. That and the other bugs shouldn't be there anymore — ambrose158, Jun 11 '17 at 18:22
Faster code != less instructions Often it does, but better organization of a data structure, even if that requires more code, will provide amazing performance gains. — user4581301, Jun 11 '17 at 18:22
Also not that objects on "the stack" (with automatic storage duration) should be rather small. Stack space is rather limited ... — Daniel Jour, Jun 11 '17 at 18:24
Recommend a read through [What is The Rule of Three?](https://stackoverflow.com/questions/4172722/what-is-the-rule-of-three) because your class violates it. The above recommendation of `std::vector` instantly resolves this problem. — user4581301, Jun 11 '17 at 18:25
You don't need nor want that manual memory management. At *least* use smart pointers. Better yet, just use `std::vector` or `std:array`. — Jesper Juhl, Jun 11 '17 at 18:25
In embedded systems, the 'major' (big? significant?) data structures are built once and last 'forever'. As such, the construction of a std::vector, with a reserve to the needed size is part of system start up, and insignificant during operation. If your two major vectors (weights and outputs) last the lifetime of your code, fragmentation does not occur. — 2785528, Jun 11 '17 at 18:25
As your sample code only calls new twice compared to the other operations it's probably not a big concern. However before you try to optimise measure and before that good algorithm design. — Richard Critten, Jun 11 '17 at 18:25
This question seems to boil down to "How do I make my program fast?", which I doubt is a proper scope — Passer By, Jun 11 '17 at 18:27
So 1. use std::vector to avoid memory management problems. 2. Since the vectors last throughout the code this will not cause problems. 3. focus on design level. — ambrose158, Jun 11 '17 at 18:31
If anyone knowns a bunch about why std::vector is better and feels like sharing please do but it's no big deal I can also look it up. Thank you! — ambrose158, Jun 11 '17 at 18:32
Step one in making your code fast is to enable optimization when you compile it. Step two is using efficient algorithms (1 & 2 can be swapped). Step three is profiling your code to identify (relevant) hot-spots and then THINK about how those could be improved (preferably algorithmically, *sometimes* by micro-optimization of the code in question; but don't go there until you've exhausted other options). — Jesper Juhl, Jun 11 '17 at 18:34

Yuki · Answer 1 · 2017-06-11T21:23:04.987

There is nothing bad or slow in heap memory specifically. However, there are cases where you should be careful in a way you use it. One of the main issues that really hurts performance is when there are a lot of small objects scattered over heap memory (because each allocation can place a new object almost anywhere in the memory). This occurs in this case, for example:

for(int i = 0; i < some_size; ++i)
  some_array = new MySmallObject;

The problem here is when you then use some_array for many operations for a long time the cache hit rate might be very low (that is one of the main parameter that you can play with on modern hardware when struggling with performace) that at the end results in the performance drop . That is because every next access to the memory some_array[i] may require access to scattered memory locations and not be able to fit into cache prediction algorithm.

Another problem is when you are not using allocated memory for a long time and reallocating memory very often for a long time in your application. This will hit your performance as well, because the dynamic memory allocation operation is not trivial one it stems from the nature of the dynamic memory. This memory does not know about the data you want to allocate and when you going to allocate them. So, simply, speaking it is a table of memory pieces that knows which are busy which are not. Now imaging you allocate a lot of small objects frequently, erasing some of the old ones. That gets out of hand. Depending on the implementation of the dynamic memory might be such operations as fragmentation of the memory, which occurrence might depend on how often you allocate/deallocate the memory. This, of course, also hits the performance.

Those are two performance killers related two dynamic memory allocations. A legit question rises what to do with that? Many problems might be solver by redesigning the program or/and redesigning the algorithms. However, there are cases where it is not possible to do, then the option will be custom allocaters. Allocaters are types that gives operations to allocate/deallocate memory. Those allocaters are template arguments and constructor arguments, for example, all over std container, like vector, string and etc. Another options is overloading global or member operator new. When you have everything custom you, of course, can use your custom design on how your objects are allocating memory, and it just may be a singleton memory pool.

Manual memory management might be a very complex task. One of the trivial solution might be just to use malloc for getting a memory piece, and use it as kind of stack, where you just move a pointer down and up. And in case the pool is out of memory you abort or resort to OS dynamic memory. Why I said malloc because it is very dangerous to use large memory reservation on stack, the stack is quite limited and you do not have control over it and with custom allocators it is super easy to go out of the stack size in on modern OS. The option might be a static memory, but again it might be quite non-flexible, you never will be able to extend your pool anymore, where with malloc you can have a trivial list of large memory pieces and add new one when the previous is exhausted.

That is all on memory, of course, there are much more to say, but that the main points that I wanted to enlighten. Other performance improvements are lying in filling all your processor power capacity: loading all the cores, employing vectorization.

Sorry, that I do not give direct answer on your question. Because it is too much to take into concern when dealing with this kind of problems. It depends on your case: as what is the domain area, what are the requirement, what kind of hardware, what is the task and etc.

How to organize time-intensive code to avoid overhead

1 Answers1