Java: micro-optimizing array manipulation

Question

I am trying to make a Java port of a simple feed-forward neural network.
This obviously involves lots of numeric calculations, so I am trying to optimize my central loop as much as possible. The results should be correct within the limits of the float data type.

My current code looks as follows (error handling & initialization removed):

/**
 * Simple implementation of a feedforward neural network. The network supports
 * including a bias neuron with a constant output of 1.0 and weighted synapses
 * to hidden and output layers.
 * 
 * @author Martin Wiboe
 */
public class FeedForwardNetwork {
private final int outputNeurons;    // No of neurons in output layer
private final int inputNeurons;     // No of neurons in input layer
private int largestLayerNeurons;    // No of neurons in largest layer
private final int numberLayers;     // No of layers
private final int[] neuronCounts;   // Neuron count in each layer, 0 is input
                                // layer.
private final float[][][] fWeights; // Weights between neurons.
                                    // fWeight[fromLayer][fromNeuron][toNeuron]
                                    // is the weight from fromNeuron in
                                    // fromLayer to toNeuron in layer
                                    // fromLayer+1.
private float[][] neuronOutput;     // Temporary storage of output from previous layer


public float[] compute(float[] input) {
    // Copy input values to input layer output
    for (int i = 0; i < inputNeurons; i++) {
        neuronOutput[0][i] = input[i];
    }

    // Loop through layers
    for (int layer = 1; layer < numberLayers; layer++) {

        // Loop over neurons in the layer and determine weighted input sum
        for (int neuron = 0; neuron < neuronCounts[layer]; neuron++) {
            // Bias neuron is the last neuron in the previous layer
            int biasNeuron = neuronCounts[layer - 1];

            // Get weighted input from bias neuron - output is always 1.0
            float activation = 1.0F * fWeights[layer - 1][biasNeuron][neuron];

            // Get weighted inputs from rest of neurons in previous layer
            for (int inputNeuron = 0; inputNeuron < biasNeuron; inputNeuron++) {
                activation += neuronOutput[layer-1][inputNeuron] * fWeights[layer - 1][inputNeuron][neuron];
            }

            // Store neuron output for next round of computation
            neuronOutput[layer][neuron] = sigmoid(activation);
        }
    }

    // Return output from network = output from last layer
    float[] result = new float[outputNeurons];
    for (int i = 0; i < outputNeurons; i++)
        result[i] = neuronOutput[numberLayers - 1][i];

    return result;
}

private final static float sigmoid(final float input) {
    return (float) (1.0F / (1.0F + Math.exp(-1.0F * input)));
}
}

I am running the JVM with the -server option, and as of now my code is between 25% and 50% slower than similar C code. What can I do to improve this situation?

Thank you,

Martin Wiboe

Edit #1: After seeing the vast amount of responses, I should probably clarify the numbers in our scenario. During a typical run, the method will be called about 50.000 times with different inputs. A typical network would have numberLayers = 3 layers with 190, 2 and 1 neuron, respectively. The innermost loop will therefore have about 2*191+3=385 iterations (when counting the added bias neuron in layers 0 and 1)

Edit #1: After implementing the various suggestions in this thread, our implementation is practically as fast as the C version (within ~2 %). Thanks for all the help! All of the suggestions have been helpful, but since I can only mark one answer as the correct one, I will give it to @Durandal for both suggesting array optimizations and being the only one to precalculate the for loop header.

Have you profiled it? It would be interesting to know where it's spending most of its time. — Brendan Long, Jun 08 '10 at 00:04
Agreed on the profiling. don't eyeball it and guess what needs improved. — Donnie, Jun 08 '10 at 00:12
Is such code easily parallelizable? If so, writing a multithreaded version of it will *own* a mono-threaded version big times. I've been there, rewriting a correctly multithreaded Quicksort in Java. It's a joy to watch on a 16-cores machine: http://stackoverflow.com/questions/2210185 (and it crushes the default Java API sorting algos **big** times). Besides that, I see a few micro-optimization but I don't know enough about neural network to be of much help. (btw lately it has become hard to buy mono-CPU machines, for example I don't know if Apple still sells mono CPU Macs) — SyntaxT3rr0r, Jun 08 '10 at 00:18
@CurtainDog JVM is warmed up when I get the best measurements (25%-50% slower than C). @Webinator Good suggestion (impressive algo!). I can parallelize the overall task and run this method concurrently though, so I'm not sure I see the benefit of splitting compute(). @Donnie and Brendan Profiling is definitely the way to go, I just didn't get any meaningful results from jvisualvm. Will try another profiler tomorrow. — Martin Wiboe, Jun 08 '10 at 00:48

score 8 · Answer 1 · answered Jun 08 '10 at 05:19

Some tips.

in your inner most loop, think about how you are traversing your CPU cache and re-arrange your matrix so you are accessing the outer most array sequentially. This will result in you accessing your cache in order rather than jumping all over the place. A cache hit can be two orders of magniture faster than a cache miss. e.g restructure fWeights so it is accessed as

activation += neuronOutput[layer-1][inputNeuron] * fWeights[layer - 1][neuron][inputNeuron];

don't perform work inside the loop (every time) which can be done outside the loop (once). Don't perform the [layer -1] lookup every time when you can place this in a local variable. Your IDE should be able to refactor this easily.
multi-dimensional arrays in Java are not as efficient as they are in C. They are actually multiple layers of single dimensional arrays. You can restructure the code so you're only using a single dimensional array.
don't return a new array when you can pass the result array as an argument. (Saves creating a new object on each call).
rather than peforming layer-1 all over the place, why not use layer1 as layer-1 and using layer1+1 instead of layer.

Wow - optimizing array access reduced running time by 20 %. Thanks. — Martin Wiboe, Jun 08 '10 at 15:25
swapping the last two indices of `fWeights` will also allow the `activation += neuronOutput[layer-1][inputNeuron] * fWeights[layer - 1][inputNeuron][neuron];` loop over inputNeuron to vectorize with SSE2 or AVX (and even FMA), if Java (or your specific JVM) has any kind of `-ffast-math` option. Turning strided accesses into contiguous is a huge advantage with SIMD. — Peter Cordes, Nov 20 '16 at 03:55

score 5 · Answer 2 · answered Jun 08 '10 at 00:09

5

For a start, don't do this:

// Copy input values to input layer output
for (int i = 0; i < inputNeurons; i++) {
    neuronOutput[0][i] = input[i];
}

But this:

System.arraycopy( input, 0, neuronOutput[0], 0, inputNeurons );

answered Jun 08 '10 at 00:09

SyntaxT3rr0r

27,745
21
87
120

Sure, but aren't there only two arrays copied in this algorithm, one to copy the inputs in, and one to copy the results out? The real costs are more likely inside the nested for loops. – Jim Ferrans Jun 08 '10 at 00:27
@W and V - True, but that's not where the bottleneck would be. – CurtainDog Jun 08 '10 at 00:36
Good suggestion - will do that :) But the running time of the method is determined by the inner loops, so it won't save the day, unfortunately. (inputNeurons is ~ 200, so it shouldn't make that big a difference) – Martin Wiboe Jun 08 '10 at 00:50

score 5 · Accepted Answer · answered Jun 08 '10 at 11:13

Disregarding the actual math, the array indexing in Java can be a performance hog in itself. Consider that Java has no real multidimensional arrays, but rather implements them as array of arrays. In your innermost loop, you access over multiple indices, some of which are in fact constant in that loop. Part of the array access can be move outside of the loop:

final int[] neuronOutputSlice = neuronOutput[layer - 1];
final int[][] fWeightSlice = fWeights[layer - 1];
for (int inputNeuron = 0; inputNeuron < biasNeuron; inputNeuron++) {
    activation += neuronOutputSlice[inputNeuron] * fWeightsSlice[inputNeuron][neuron];
}

It is possible that the server JIT performs a similar code invariant movement, the only way to find out is change and profile it. On the client JIT this should improve performance no matter what. Another thing you can try is to precalculate the for-loop exit conditions, like this:

for (int neuron = 0; neuron < neuronCounts[layer]; neuron++) { ... }
// transform to precalculated exit condition (move invariant array access outside loop)
for (int neuron = 0, neuronCount = neuronCounts[layer]; neuron < neuronCount; neuron++) { ... }

Again the JIT may already do this for you, so profile if it helps.

Is there a point to multiplying with 1.0F that eludes me here?:

float activation = 1.0F * fWeights[layer - 1][biasNeuron][neuron];

Other things that could potentially improve speed at cost of readability: inline sigmoid() function manually (the JIT has a very tight limit for inlining and the function might be larger). It can be slightly faster to run a loop backwards (where it doesnt change the outcome of course), since testing the loop index against zero is a little cheaper than checking against a local variable (the innermost loop is a potentical candidate again, but dont expect the output to be 100% identical in all cases, since adding floats a + b + c is potentially not the same as a + c + b).

Arrays and precalculation seemed to improve the overall running time by 25 % :) Thanks. — Martin Wiboe, Jun 08 '10 at 15:26

Dominic Cerisano · Answer 4 · 2016-11-20T03:29:12.300

Replace the expensive floating point sigmoid transfer function with an integer step transfer function.

The sigmoid transfer function is a model of organic analog synaptic learning, which in turn seems to be a model of a step function.

The historical precedent for this is that Hinton designed the back-prop algorithm directly from the first principles of cognitive science theories about real synapses, which in turn were based on real analog measurements, which turn out to be sigmoid.

But the sigmoid transfer function seems to be an organic model of the digital step function, which of course cannot be directly implemented organically.

Rather than model a model, replace the expensive floating point implementation of the organic sigmoid transfer function with the direct digital implementation of a step function (less than zero = -1, greater than zero = +1).

The brain cannot do this, but backprop can!

This not only linearly and drastically improves performance of a single learning iteration, it also reduces the number of learning iterations required to train the network: supporting evidence that learning is inherently digital.

Also supports the argument that Computer Science is inherently cool.

score 3 · Answer 5 · edited May 23 '17 at 10:32

3

First thing I would look into is seeing if Math.exp is slowing you down. See this post on a Math.exp approximation for a native alternative.

edited May 23 '17 at 10:32

Community

1
1

answered Jun 08 '10 at 00:06

nivekastoreth

1,407
13
19

I was thinking a lookup table for the entire `sigmoid()` function might be worthwhile, but it's hard to say without knowing how much time is spend in that function. – Brendan Long Jun 08 '10 at 00:08
It's almost certain a lookup table would greatly increase speed of that function, and possibly help you regain the 25% loss from C to Java. If you doubt how much time is spent there, use some profiling tools to determine what is taking so long. But since it's at least being calculated layer*neuron times, there's a good chance this is one bottleneck that can be easily eliminated. – drharris Jun 08 '10 at 00:13
I have tried using that approximation, but unfortunately the results are too inaccurate. Do you know any way to increase the accuracy by trading off some speed? @Brendan Long and drharris Lookup table could very well be an option - I will be doing millions of calculations. How would one go about implementing a thread-safe lookup table that uses floating-point numbers as the key? – Martin Wiboe Jun 08 '10 at 00:53
2

Well, check this older post out, for many examples on how to improve the typical Sigmoid function: http://stackoverflow.com/questions/412019/math-optimization-in-c – drharris Jun 08 '10 at 01:14
If your questions is "how do I make this code run faster", before going much deeper on optimizing the trade off between speed and accuracy, I'm curious if the inaccurate (yet faster) method pushed you closer to an acceptable run time? If not, perhaps we're looking in the wrong area. – nivekastoreth Jun 14 '10 at 18:37

score 1 · Answer 6 · answered Jun 08 '10 at 00:11

Purely based upon code inspection, your inner most loop has to compute references to a three-dimensional parameter and its being done a lot. Depending upon your array dimensions could you possibly be having cache issues due to have to jump around memory with each loop iteration. Maybe you could rearrange the dimensions so the inner loop tries to access memory elements that are closer to one another than they are now?

In any case, profile your code before making any changes and see where the real bottleneck is.

Profiling will definitely help. I'll try and switch the last two indices in fWeights[layer - 1][inputNeuron][neuron], so that inputNeuron, which varies, is the 3rd index. — Martin Wiboe, Jun 08 '10 at 01:01

score 1 · Answer 7 · answered Jun 08 '10 at 00:14

1

I suggest using a fixed point system rather than a floating point system. On almost all processors using int is faster than float. The simplest way to do this is simply shift everything left by a certain amount (4 or 5 are good starting points) and treat the bottom 4 bits as the decimal.

Your innermost loop is doing floating point maths so this may give you quite a boost.

answered Jun 08 '10 at 00:14

Daniel

1,994
15
36

In general, a good point (in fact many systems which actually do require fixed precision are wrong because they naively use FP). However in this case I don't think the sigmoid function lends itself well to this technique. – CurtainDog Jun 08 '10 at 00:35
With modern hardware, one FP instruction is faster than multiple integer instructions required to do the same thing in fixed-point. (especially for multiply, where you need a shift to put the point in the right place; add/sub are cheaper.) – Peter Cordes Nov 20 '16 at 03:46
Integer is great for processing pixels which were originally integer to start with, since often 16 bits per element is enough. So you can get twice as many elements per SIMD vector vs. with float, and there are some SSE instructions specifically designed for the stuff that's often needed on pixels. So working with integers is useful when you have multiple 16-bit elements to do in parallel, esp. if it saves converting to/from float. For other cases, it's often not worth it. – Peter Cordes Nov 20 '16 at 03:47
The latest CPUs even support vector FP fused-multiply-add with one instruction, but not integer. So you can actually get more done per clock with FP on the latest Intel/AMD CPUs. (FMA instructions were introduced after this answer was written, but they might be useful to a JVM running that sum loop.) – Peter Cordes Nov 20 '16 at 03:51

score 0 · Answer 8 · answered Jun 08 '10 at 00:24

The key to optimization is to first measure where the time is spent. Surround various parts of your algorithm with calls to System.nanoTime():

long start_time = System.nanoTime();
doStuff();
long time_taken = System.nanoTime() - start_time;

I'd guess that while using System.arraycopy() would help a bit, you'll find your real costs in the inner loop.

Depending on what you find, you might consider replacing the float arithmetic with integer arithmetic.

Java: micro-optimizing array manipulation

8 Answers8

Linked