How to improve performance without going parallel for my backprop ANN

Question

After profiling my Back propagation algorithm, I have learnt it is responsible for taking up 60% of my computation time. Before I start looking at parallel alternatives, I would like to see if there is anything further I can do.

The activate(const double input[]) function is profiled to only take ~5% of the time. The gradient(const double input) function is implemented as follows:

inline double gradient(const double input) { return (1 - (input * input)); }

The training function in question:

void train(const vector<double>& data, const vector<double>& desired, const double learn_rate, const double momentum) {
        this->activate(data);
        this->calculate_error(desired);

        // adjust weights for layers
        const auto n_layers = this->config.size();
        const auto adjustment = (1 - momentum) * learn_rate;

        for (size_t i = 1; i < n_layers; ++i) {
            const auto& inputs = i - 1 > 0 ? this->outputs[i - 1] : data;
            const auto n_inputs = this->config[i - 1];
            const auto n_neurons = this->config[i];

            for (auto j = 0; j < n_neurons; ++j) {
                const auto adjusted_error = adjustment * this->errors[i][j];

                for (auto k = 0; k < n_inputs; ++k) {
                    const auto delta = adjusted_error * inputs[k] + (momentum * this->deltas[i][j][k]);

                    this->deltas[i][j][k] = delta;
                    this->weights[i][j][k] += delta;
                }

                const auto delta = adjusted_error * this->bias + (momentum * this->deltas[i][j][n_inputs]);

                this->deltas[i][j][n_inputs] = delta;
                this->weights[i][j][n_inputs] += delta;
            }
        }
    }
}

This question may be better suited to https://codereview.stackexchange.com/.

what about the gradient() function? Also you might find better performance using iterators. — ColWhi, Apr 19 '11 at 08:50
Have you tried profiling the various loops separately? I see two O(n^3) chunks. Maybe one of them is less important. (Also, try iterators, they might help)\ — Shiroko, Apr 19 '11 at 09:02
If I understand that correctly, the vectors are the variables named outputs and inputs is that right? Don't std::vector array accessors incur function call overhead? — hookenz, Apr 19 '11 at 09:03
`const double lower_layer_output = i > 0 ? outputs[lower_layer][k] : input[k]; // input layer semantics` may be it's possible to avoid condition checking by defining a pointer to `outputs[lower_layer]` or `input` somewhere in outer loop? — Andrey Sboev, Apr 19 '11 at 09:06
@Matt H no overhead that wasn't completely optimized out by the compiler, and profiling shows this. You may be thinking of the differences in debug vector accessors. — hiddensunset4, Apr 19 '11 at 09:26
@Daniel: could you post the definition of the Backprop class, so that we can have a minimal compilable example ? It's difficult to optimize "a priori". — Matthieu M., Apr 19 '11 at 09:29
@Daniel: I agree with Andrey, remove the conditionals from the loops. Unroll the loop for at least the iteration where `i == 0`. It is likely that GCC will then be able to vectorize your code better. — wilx, Apr 19 '11 at 11:34
You're asking for guesses. Everyone's giving you guesses. [There's no need to guess.](http://stackoverflow.com/questions/375913/what-can-i-use-to-profile-c-code-in-linux/378024#378024) [Here's an example.](http://stackoverflow.com/questions/926266/performance-optimization-strategies-of-last-resort/927773#927773) — Mike Dunlavey, Apr 19 '11 at 14:37
@Shiroko I must be thankful for this suggestion, at first it gave me nothing because GCC basically inlines everything, but on finding the GCC attribute to disabling inlining for particular functions, I got some useful information. — hiddensunset4, Apr 21 '11 at 05:31

Gunther Piez · Accepted Answer · 2011-04-19T09:28:10.130

You can't avoid an O(n^2) algorithm if you want to train/use a NN. But it is perfectly suited for vector arithmetic. For example with clever use of SSE or AVX you could process the neurons in chunks of 4 or 8 and use a multiply-add instead of two separate instructions.

If you use a modern compiler and carefully reformulate the algorithm and use the right switches, you might even get the compiler to autovectorize the loops for you, but your mileage may vary.

For gcc, autovectorization is activated using -O3 or -ftree-vectorize. You need an vector capable cpu of course, something like -march=core2 -mssse4.1 or similar, depending on the target cpu. If you use -ftree-vectorizer-verbose=2 you get detailed explanations, why and where loops were not vectorized. Have a look at http://gcc.gnu.org/projects/tree-ssa/vectorization.html .

Better is of course using the compiler intrinsics directly.

Great suggestion for that -ftree-vectorizer-verbose=2, it showed me which loops it was unable to vectorize, bringing the total up to 2 from 0 when moving conditionals out. There was no seen difference, but I definitely learnt something :) This answer pointed me to a few articles after googling, which I learnt a lot about loop vectorization, so I'm accepting this as the most helpful answer. — hiddensunset4, Apr 23 '11 at 06:34

Guy Sirton · Answer 2 · 2011-04-23T17:17:12.617

You want to eliminate the conditional from inside your loop here:

const double lower_layer_output = i > 0 ? outputs[lower_layer][k] : input[k]; // input layer semantics

You can eliminate this condition by calculting the zero'th iteration (the special case of i==0) earlier.

        deltas[i][j][k] = delta;
        weights[i][j][k] += delta;

You mention using std::vector, so this is a vector of vector of vector? Your data is not going to be contiguous (except in the sense that each vector is contigous). Consider using C style arrays.

How big are those dimensions? There may be some caching considerations if very large. E.g. you don't want that last subscript [k] to flush the L1 cache. Sometimes breaking the loop to process a smaller range of k indexes at a time can help (strip mining).

You can also experiment with unrolling your inner loops a little, e.g. try doing 4 or eight operations inside the loop. Increment by 4/8 respectively and handle any remainder in another loop. The compiler may be doing that already.

As others have mentioned using SIMD (SSE/AVX) is probably where you can find the most gain. You can either use compiler intrinsics (link is to Visual Studio but gcc has support with same syntax) or write in assembly (inlined or otherwise). As you mentioned, scaling across multiple cores is another direction. OpenMP can help you do that without a lot of pain.

Sometimes it is useful to generate an annotated assembly listing from your code to try and see where the compiler isn't doing such a great jobs.

This is an excellent general resource about optimization.

Unfortunately, moving all the conditionals out of the loops, made no difference in performance, they were trivial enough that it is probably all optimized by the compiler. The vector sizes aren't large, so they should fit within L1 cache easily. There is no performance gain of using C style arrays compared to dynamic resizable, I already made the switch from them prior with 0 performance cost. — hiddensunset4, Apr 23 '11 at 06:12
@Daniel it's probably because everything before the [k] is treated as invariant and moved out of the loop including the condition. I'll add a few more thoughts to my answer though most of the points have been covered by others. — Guy Sirton, Apr 23 '11 at 16:58

score 1 · Answer 3 · answered Apr 19 '11 at 09:09

1

I'm not sure if the compiler can optimize it in your case but getting out inverse_momentum * (learn_rate * errors[i][j]) to a variable outside to loop "k" in the lower loops might decrease the load on the CPU.

BTW you are profiling a release binary and not a debug one aren't you.

answered Apr 19 '11 at 09:09

szilard.huber

41
4

1

normally it would, if it can establish it's possible (this is called Loop Invariant optimization). It does not hurt to move it out though, makes it easier on the reader :) – Matthieu M. Apr 19 '11 at 11:11

sehe · Answer 4 · 2011-04-21T08:15:51.790

I'm not fond of valarray, but I have a hunch that there is quite some opportunity for those around here.

Blitz++ (boost) seems to have a better aura around the web, but I don't know it :)

I was starting to work on a PoC myself, but there are too many missing bits of code

void activate(const double input[]) { /* ??? */ }
const unsigned int n_layers_ns;
const unsigned int n_layers;
const unsigned int output_layer_s;
const unsigned int output_layer;

T/*double?*/ bias = 1/*.0f?/;

const unsigned int config[];
double outputs[][];
double errors [][];
double weights[][][];
double deltas [][][];

Now it follows logically from the code that at least the first (rank-0) indices to the arrays are defined by the 4 missing constants. If these constants can be known compile time, these would make great value class template parameters:

template <unsigned int n_layers_ns = 2,
          unsigned int n_layers = 3>
struct Backprop {
    void train(const double input[], const double desired[], const double learn_rate, const double momentum);

    void activate(const double input[]) { }
    enum _statically_known
    {
        output_layer = n_layers_ns - 1,
        output_layer_s = n_layers - 1, // output_layer with input layer semantics (for config use only)

        n_hidden_layers = output_layer - 1,
    };

    static const double bias = 1.0f;

    const unsigned int config[];
    double outputs[3][50];      // if these dimensions could be statically known, 
    double errors[3][50];       //        slap them in valarrays and 
    double weights[3][50][50];  //        see what the compiler does with that!
    double deltas[3][50][50];   // 
};

template <unsigned int n_layers_ns,
          unsigned int n_layers>
void Backprop<n_layers_ns, n_layers>::train(const double input[], const double desired[], const double learn_rate, const double momentum) {
    activate(input);

    // calculated constants
    const double inverse_momentum = 1.0 - momentum;
    const unsigned int n_outputs = config[output_layer_s];

    // calculate error for output layer
    const double *output_layer_input = output_layer > 0 ? outputs[output_layer] : input; // input layer semantics
    for (unsigned int j = 0; j < n_outputs; ++j) {
        //errors[output_layer][j] = f'(outputs[output_layer][j]) * (desired[j] - outputs[output_layer][j]);
        errors[output_layer][j] = gradient(output_layer_input[j]) * (desired[j] - output_layer_input[j]);
    }
    [... snip ...]

Notice how I reordered the statements in the first loop a bit, to make the loop trivial. Now, I can imagine those last lines becoming

// calculate error for output layer
const std::valarray<double> output_layer_input = output_layer>0? outputs[output_layer] : input; // input layer semantics
errors[output_layer] = output_layer_input.apply(&gradient) * (desired - output_layer_input);

This will require the proper (g)slices to be setup for the inputs. I cannot work out how these would have to be dimensioned. The crux of the matter is that as long as these slice dimensions can be statically determined by the compiler, you will have the potential for signifcant time savings, since the compiler can optimize these into vectorized operations on either the FPU stack or the using the SSE4 instruction set. I suppose you would declare your output something like this:

std::valarray<double> rawoutput(/*capacity?*/);
std::valarray<double> outputs = rawoutput[std::slice(0, n_outputs, n_layers)]; // guesswork

(I expect weights and deltas would have to become gslices because AFAICT they are 3-dimensional)

Miscellanea (alignment, layout)

I realized that there probably won't be much gain if the arrays ranks (dimensions) aren't optimally ordered (e.g. the first rank in a valarray is relatively small, say 8). This could hamper vectorization because the participating elements could be scattered in memory, where I suppose optimization requires them to be adjacent.

In this light it is important to realize that the 'optimal' ordering of the ranks is ultimately dependent on the access patterns alone (so, profile and inspect again).

Also, the opportunity for optimization might be hampered by unfortunate memory alignment [1]. In that light, you might want to switch the order of (val)array ranks and round rank dimensions to nearest powers of 2 (or mor practically, say multiples of 32 bytes).

If all this actually makes a big impact (profile/inspect generated code first!) I would imagine support

that Blitz++ or boost might contain helpers (allocators?) to enforce alignments
your compiler will have attributes (of the align and/or restrict kind) to tell them they can assume these alignments for input pointers

Unrelated:

If the order of execution is not crucial (i.e. the relative orders of magnitude of factors are very similar), instead of

inverse_momentum * (learn_rate * ???)

you could take

(inverse_momentum * learn_rate) * ???

and precalculate the first subproduct. However, from the fact that it is explicitely ordered this way, I'm guessing that this would lead to more noise.

[1] disclaimer: I haven't actually done any analysis on that, I'm just throwing it out there so you don't miss the 'though joint' (how's that for Engrish)

added thoughts about memory layout of arrays – sehe Apr 21 '11 at 08:20 — sehe, Apr 21 '11 at 08:20

How to improve performance without going parallel for my backprop ANN

4 Answers4

Miscellanea (alignment, layout)

Unrelated:

Linked