Running over an unrolled linked list takes around 40% of the code runtime - are there any obvious ways to optimise it?

Question

I am the author of an open source scientific code called vampire ( http://github.com/richard-evans/vampire ), and being compute intensive means any improvement in code performance can significantly increase the amount of research that can be done. Typical runtimes of this code can be hundreds of core hours, so I am always looking for ways to improve the performance critical sections of the code. However, I have come a bit stuck with the following bit of relatively innocuous looking bit of code, which makes up around 40% of the runtime:

for (int atom = start_index; atom < end_index; atom++){
    register double Hx = 0.0;
    register double Hy = 0.0;
    register double Hz = 0.0;
    const int start = atoms::neighbour_list_start_index[atom];
    const int end = atoms::neighbour_list_end_index[atom] + 1;
    for (int nn = start; nn < end; nn++){
        const int natom = atoms::neighbour_list_array[nn];
        const double Jij = atoms::i_exchange_list[atoms::neighbour_interaction_type_array[nn]].Jij;
        Hx -= Jij * atoms::x_spin_array[natom];
        Hy -= Jij * atoms::y_spin_array[natom];
        Hz -= Jij * atoms::z_spin_array[natom];
    }
    atoms::x_total_spin_field_array[atom] += Hx;
    atoms::y_total_spin_field_array[atom] += Hy;
    atoms::z_total_spin_field_array[atom] += Hz;
}

The high level overview of the function and variables of this code is as follows: There is a 1D array of a physical vector ( split into three 1D arrays for each component x,y,z for memory caching purposes, atoms::x_spin_array, etc) called 'spin'. Each of these spins interact with some other spins, and all the interactions are stored as a 1D neighbour list (atoms::neighbour_list_array). The relevant list of interactions for each atom is determined by a start and end index of the neighbor listarray in two separate arrays. At the end of the calculation the each atomic spin has an effective field which is the vector sum of the interactions.

Given the small amount of code and the sizable fraction of the runtime it occupies I have done by best, but I feel there must be a way to optimize this further, but as a physicist rather than a computer scientist maybe I am missing something?

What if you replace linked lists with vectors or arrays? Since they require less pointer traversals, and they're also more compact, they can be more cache-friendly, consequently faster. — , Jul 22 '13 at 18:53
Make sure you have unit tests in place so that when you do come up with an optimized version of the code, it will still give you the same results (especially important if you want your research results to be valid)! — In silico, Jul 22 '13 at 18:54
@rfle500: What are the types of `neighbour_list_start_index` and the other things used here. Are those ones vectors? — Mooing Duck, Jul 22 '13 at 18:56
@H2CO3 Ahh sorry - they are all std::vectors - it's a linked list only in concept - it is all unrolled to 1D vectors. One thing I did consider is if its worth unrolling the spin array as well, at the cost of more memory use — rfle500, Jul 22 '13 at 18:57
What confidence do you have that you actually can go faster? At least get a decent profiler that shows you L1 D-cache miss rate. If it is high then you'll have a shot at it. — Hans Passant, Jul 22 '13 at 18:58
@MooingDuck they are just ints, doubles etc, no pointers anywhere. — rfle500, Jul 22 '13 at 18:58
@rfle500 Maybe C++ guys will kill me for this, but what if you replace the vectors with raw *stack* (i. e. automatic) arrays? Vectors allocate memory dynamically, that can be a problem as well. — , Jul 22 '13 at 18:59
@H2CO3: heap vs stack shouldn't make any difference in this code — Mooing Duck, Jul 22 '13 at 19:01
@H2CO3: I just benchmarked another question, and for a small number of items (so cache is still useful), vector vs. array gave at most 10%. So that's 4% of the total runtime, in this case. — Mats Petersson, Jul 22 '13 at 19:02
Perhaps you can do something about those interaction types - deal with each of them separately. You could then remove two array accesses and three multiplications from the inner loop. — zch, Jul 22 '13 at 19:05
@MatsPetersson: Were you using gcc and -O3 on those tests? I've done benchmarks on vectors ranging in size from 100s to millions, and vector and array perform identically(within system noise < 2%) unless compiling with LLVM, which seemed to mis-optimize. — MobA11y, Jul 22 '13 at 19:05
@H2CO3: Indeed. But it really depends on cache-locality, cache-hit rate, and lots of other things. — Mats Petersson, Jul 22 '13 at 19:05
@rfle500: neighbour_list_start_index, neighbour_list_end_index, neighbour_list_array, neighbour_interaction_type_array, i_exchange_list, x_spin_array, y_spin_array, z_spin_array, and x_total_spin_field_array are all vectors? — Mooing Duck, Jul 22 '13 at 19:05
Also, is it viable to change `x_spin_array[natom]` into `spin_array[natom].x`? (And same with x_total_spin_field_array). That should give better cache performance. — Mooing Duck, Jul 22 '13 at 19:06
@OP: have you considered using threading in this case? If these data structures are large you may stand to gain considerably by launching a couple of worker threads to accumulate your results, and joining them. Odds/evens, or other modulus can be a good approach for indexing... Might I recomend your modulus be the number of computing cores you're working with... — MobA11y, Jul 22 '13 at 19:07
@ChrisCM: It was g++ -O3, yes. It was a question about SSE optimisation from earlier today. I had a vector of vector of int vs. a int [100000][128], and only measuring the actual calculation stage (so no memory allocation). 10% was the best I could get, sometimes it was a bit less than that. — Mats Petersson, Jul 22 '13 at 19:07
@MooingDuck - Yes they are all vectors. Packing the data is possible, but as far as I was aware the jury was still out regarding whether this is generally faster, if I understand multi line cache correctly? At least in an older code, packing the data was marginally slower, though I didn't try it with this one yet... — rfle500, Jul 22 '13 at 19:09
@Mats: interesting, I'm really into benchmarking and backend performance stuff. Would you mind directing me at the post/starting a separate discussion about this? — MobA11y, Jul 22 '13 at 19:10
@ChrisCM In fact the code is massively parallel using the MPI library, with 1 thread/core which scales very well, so I am indeed used to throwing many cores at the problem. But here I see it as a fundamentally serial problem. — rfle500, Jul 22 '13 at 19:12
@zch The interaction types are often just a single number, but not always. The problem in general requires that each interaction can be unique, though that is not always the case. — rfle500, Jul 22 '13 at 19:14
Are there many repetitions in the data? For example, would it make sense to cache the terms you are subtracting for a given `nn` instead of performing the lookup and multiplication each time? — Omri Barel, Jul 22 '13 at 19:14
I agree with @HansPassant. Profile for cache misses. If your code runs on linux, use cachegrind: http://valgrind.org/docs/manual/cg-manual.html to see if it reports anything valuable. — JimR, Jul 22 '13 at 19:16
@OmriBarel Generally each spin behaves independently and the spin coordinates are different every time, and so the calculated fields are also different. Also the neighbour list array is quite big so probably gets flushed from the cache each time, as there is a lot of other stuff going on between calls. — rfle500, Jul 22 '13 at 19:18
@JimR and HansPassant - thanks - I will look into cache grind and report back. — rfle500, Jul 22 '13 at 19:19
@rfle500 my point was that essentially you are doing this: `Hx -= atoms::i_exchange_list[atoms::neighbour_interaction_type_array[nn]].Jij * atoms::x_spin_array[atoms::neighbour_list_array[nn]]` for each `nn` between `start` and `end`. If `start` and `end` repeat themselves, then it's better to store the result of this monster for `nn` (it only depends on `nn`) than to calculate it each time. — Omri Barel, Jul 22 '13 at 19:20
@ChrisCM: THis one. I haven't posted my benchmark tho - it can be found here: http://pastebin.com/esh0mnD6 (you'll need to comment out the `GetDiff2` part if you do #define VECTOR 0 (sorry, missed out the link, will get back to that). Here it is: http://stackoverflow.com/questions/17791892/c-use-sse-instructions-for-comparing-huge-vectors-of-ints — Mats Petersson, Jul 22 '13 at 19:20
@OmniBarel Even if atoms::i_exchange_list is a single number? — rfle500, Jul 22 '13 at 19:23
My suggestion was, that you could try to reorganize code, so that you don't need to recalculate `JiJ` inside the tightest loop. Then you could multiply after summing is complete (for some interaction type). — zch, Jul 22 '13 at 19:23
It looks like you can split it into parallel X, Y and Z parts. — n. m. could be an AI, Jul 22 '13 at 19:25
@zch Ahh I get it now - that's an interesting suggestion. In general the number of interaction types << number of atoms, so that would definitely work, and might be a lot faster — rfle500, Jul 22 '13 at 19:25
There is still a double lookup + fixed offset just to get `Jij` from `nn` (two arrays + struct/class). Then for `x_spin_array` again two lookups. And then there's a multiplication. But the result only depends on `nn`, so I would definitely try to get get a single `x_result[nn]` instead of the full expression (outside the inner loop, of course). — Omri Barel, Jul 22 '13 at 19:27
@n.m. Yes I could, though since the neighbour list array is large, I thought that this might be less efficient due to larger numbers of reads from memory? — rfle500, Jul 22 '13 at 19:27
@OmriBarel Ok I see - that makes sense. I will definitely try unrolling the interactions and see if that helps. Thanks. — rfle500, Jul 22 '13 at 19:32
Well, I'm going suggest it despite not really solving the problem, but each of the calculations in this are independent, so multi-threading may help out. (this is just a 2D reduction, after all). You could also go down the GPU processing rout to get 10X-100X speedup (maybe, there are a lot of caveats to this). — IdeaHat, Jul 22 '13 at 20:09
@MadScienceDreams From the comments so far, I think this code is already highly parallelized at a higher level, so introducing additional multi-threading here is just going to produce overhead. — Sebastian Redl, Jul 23 '13 at 10:10

score 6 · Answer 1 · answered Jul 23 '13 at 09:50

6

You've got a constant stream of multiply, subtract and adds on contiguous data. That's seems like an ideal use of SSE. If it's memory bandwidth limited, then OpenCL/CUDA instead.

Try using this library if you aren't familiar with all the low level instructions.

That inner loop could potentially then be restructured significantly maybe leading to speed ups.

answered Jul 23 '13 at 09:50

Delta_Fore

3,079
4
26
46

+1, more if I could. @rfle500 Actually... Writing the type of code you're writing should require reading everything published on agner fogs website... www.agner.org -- I totally forgot to reference it. – JimR Jul 24 '13 at 07:37

score 2 · Answer 2 · answered Jul 23 '13 at 10:34

If the x, y, z components are indeed linked lists, doing x[i], y[i] and z[i] will cause the lists to be traversed multiple times, giving (n^2)/2 iterations. Using vectors will make this an O(1) operation.

You mention that the three coordinates are split out for memory caching purposes, but this will affect the Level 1 and Level 2 cache locality as you are accessing 3 different areas in memory. The linked list is also impacting your cache locality.

Using something like:

struct vector3d {
    double x;
    double y;
    double z;
};

std::vector<vector3d> spin;
std::vector<vector3d> total_spin;

This should improve the cache locality, as the x, y and z values are adjacent in memory and the spins occupy a linear block of memory.

score 0 · Answer 3 · edited Jul 23 '13 at 09:59

0

I feel following suggestions can help you optimize the code a bit if not completely:

Use initialization over assignments wherever possible
Prefer pre-increment over post for better speed.(believe me, it does make a change)

Apart from that I think the code is just fine. There are some pros and cons of each DS..you gotta live with it.

Happy Coding!

edited Jul 23 '13 at 09:59

Rohit Vipin Mathews

11,629
15
57
112

answered Jul 23 '13 at 09:04

Mudassir Razvi

1,783
12
33

Running over an unrolled linked list takes around 40% of the code runtime - are there any obvious ways to optimise it?

3 Answers3