(Speed) optimization effecting unexpected section of code

Question

I am doing speed optimization on my own code. Pseudocode is something like this:

1. myStructure = constructStructure (param1)
2. processStructure(myStructure)
3. myStructure = constructStructure (param2)
4. processStructure(myStructure)

I was focusing on optimizing the function constructStructure(param) since the two calls to the function take up about 75% of the time. I did not touch the function processStructure(structure).

I did get a nice speed up (the whole thing is now taking about 75% of the original time), but when I measured respective times of operations 1.-4. I got unexpected results:

      before        after
1.   75.10ms      56.88ms
2.   23.12ms      19.32ms
3.   70.72ms      53.22ms
4.   20.81ms      14.45ms

I got a (small, but) significant speedup in the parts 2. and 4. that were not changed! This is measured on a 1000 runs and then averaged. (I did not calculate the standard deviation, but I did run and display individual times 20-ish times for each variant, and it seems consistent with the averages).

The structures produced before and after optimization are identical as far as I can tell, and the final output of the program is the same in both cases.

There is no (substantial) memory leaks as far as I can tell -- I was monitoring my systems memory during the test runs and there was no consistent increase in the used memory. (I know this is not a very good way to test that, and digging into potential memory leakage is my next stop. Plus, I do make a point of freeing/deleting all the reserved memory as soon as I don't need it).

Not that I'm not happy about the speed up, but I can't even begin to understand what has happened! As it is the code I will be using for quite a while after I finish working on it, having a mystery in the code is not an appealing option.

Does anybody have any ideas as to what happened and how I can verify if what you are suggesting is really the case? PS. I'm working in C++.

As the actual code is too big to put here (100 lines for producing the structure + 300 lines to process it), I can describe it a little bit, as well as describing the change:

constructStructure

It is a function building a tree structure (not a binary tree) (based on grayscale image pixels), where every node is assigned an int attribute.

The parameter the constructStructure function is a Comparator indicating the ordering of pixel intensities (first time is less<int>(), second time greater<int>()).

The only change in the optimized procedure is using std::priority_queue instead of std::multimap to implement a heap structure (as described in this question of mine -- except it is not used with std::pair<int,int> but instead on std::pair<int,myData>)

I have verified that the produced myStructure is an equivalent tree (by examining the produced tree for 10 different images) in a sense:

the produced tree has the same number of nodes
the data contained in the nodes is the same
but the order of children within a node is different when using std::multimap than when using std::priority_queue (the children are again the nodes holding the same data)
conclusion: the trees are equivalent as to the data it holds and its structure up to the order of the child nodes within any single parent node

processStructure

It is a function that examines the constructed tree, in a DFS (bottom-up) fashion. The complexity depends only on the number of nodes in the tree and examines only the attribute assigned to each node (not the data contained in the node, which is available for further processing not used here).

Finally, after further testing and the differences between the nodes order I pointed out, the question would be: is it possible that the order of the nodes in the tree produces this significant change in performance, even if (when traversing the tree with a DFS approach) the data within the nodes is not examined but only a single integer attribute for each node?

I'm afraid, there's no way of analyzing this without seeing the actual code. — Vince, Sep 20 '13 at 14:27
have you *verified*, that you haven't altered behaviour? is `myStructure` same after optimization? — Karoly Horvath, Sep 20 '13 at 14:30
my guess (without seeing the code) is that you have actually changed the **content** of `myStructure` which is causing the `processStructure` to take a different code path. I would start with comparing the original `myStructure` against the new `myStructure`... — Andrew Bickerton, Sep 20 '13 at 14:34
@KarolyHorvath I have checked. I'm producing a giant tree (~5000 nodes), so I could not print it out. But, before and after the optimization, it contained the same number of nodes. The only optimization I introduced during the construction process is substituting `std::multimap` with `std::priority_queue` as described [in this question](http://stackoverflow.com/q/18913613/884412) — penelope, Sep 20 '13 at 14:43
Unfortunately, I can not share the code -- the construction function is some _100_ lines of code, and the process function(s) are some _300_ lines of code. I can describe the process in a few lines (will edit the question) — penelope, Sep 20 '13 at 14:46
I think the thing I would focus on is the relative performance of `multimap` vs `priority_queue` for the various operations you use (I'm assuming your construct phase does all the insertions, etc, and the processing phase just visits each node, but without seeing code, that's not necessarily a sane assumption). — twalberg, Sep 20 '13 at 15:02
@twalberg There's no `multimap`\`priority_queue` used in processing. The said structure is used only as a _helper heap_ while constructing the tree. While processing, the tree has a standard `Node` with an array-of-children structure — penelope, Sep 20 '13 at 15:11
@ninjalj I am, I can, but how would it help? 1st if I run it on Release (optimized) version, I can not even see the function names (and the optimization does make a significant difference)... and 2nd -- I did, it just tells me the same thing, that the function is faster while it shouldn't be (e.g. the relative speed of the functions is the same) — penelope, Sep 20 '13 at 15:13
Ah, perhaps I mis-read that bit, then. In that case, I might guess that `Nodes` might be being allocated in a different order that happens to provide better locality-of-reference or other cache-friendly access patterns. — twalberg, Sep 20 '13 at 15:14
@twalberg Can you please expand your last comment, and perhaps put it in to an answer? It sounds plausible in my case, because (as I just edited) the content of the `Node`s is the same, but the order is not. Also, if you have a suggestion on how to potentially verify this theory and add that in to the answer I would be very happy to try it out. — penelope, Sep 20 '13 at 19:25
I know that my question does not include any program code -- but can somebody please suggest how to fix the question to be more suitable? The two functions in question are 100 and 300 lines of code, and even then it is not standalone because I use a lot of code for implementing Nodes and Trees that are used in the functions. — penelope, Sep 23 '13 at 08:51

score 3 · Answer 1 · answered Sep 20 '13 at 14:30

3

What is possible:

Inlining: Maybe, processStructure() and constructStructure() located in the include file, or in the file, where you call those functons. If so, compiler can inline code itself, for save on proc-call. So, this is just effect of macro-optimization of combination functions calls.
Cache effect: Maybe, you use less code, or less memory, and after optimization something can fit into cache L1/L2.

I suggest you generate assembly file for both program variants (-S option for gcc), and compare assembler code.

answered Sep 20 '13 at 14:30

olegarch

3,670
1
20
19

Both functions are very big (100 and 300 lines of code) so inlining is probably not the case. Your second suggestion sounds interesting: can you please expand on how to verify it? It is all part of a big project (personal library in making) so examining assembly files will be very hard: however, if you expand on how it will help and what I might be looking for to verify the suggestion, I will gladly try it out and report my success back. Thank you for the suggestions – penelope Sep 20 '13 at 19:31
Figure out caching effect - is difficult and unstable task, because of it is side effect of hardware configuration; But, one possibility - try insert some heavy function between calls construct() and process(); and compare with a time, when heavy function not between, but aside (before or late). Also, if you want, I can help you optimize your code... – olegarch Sep 20 '13 at 19:41
well, it's really late here and I'm not working any more tonite... thanks for the offer to help with optimization, but it's a 10+ source files in the project, with all entwined and I don't know how you could help (suggestions welcome thou). But, it's not my first time optimizing, I used valgrinds cachegrind to identify the slowest parts of code and that's the one discussed here. And the results is becoming acceptable for what I need. Will try the _heavyFunction_ tomorrow -- can you expand your answer with the suggestion maybe? – penelope Sep 20 '13 at 20:11
HeavyFunction() just overwrite cache, and by this way - decrease "caching effect". But, I read above, you use dynamic data structures from STL (multimap, priority_queue) - there are complex structures, can be shared on many memory pages. So, for clarify and narrow effect source - need to minimize or disable dynamic memory usage during your procedures. Of course, program will not work properly, but this is just temporary, for investigation only. – olegarch Sep 20 '13 at 20:52

score 1 · Answer 2 · answered Sep 20 '13 at 22:01

This is a bit of a guess, since we don't have code to look at, and it would take some careful study to verify, even with the code, and even then, it might perform completely differently on a different system. But, caching and locality of reference and virtual memory translations can have the sort of significant performance impact seen here.

Take a careful look at the order in which your processing phase visits nodes (i.e. in-order, pre-order, post-order, random, whatever). Then consider where each of those nodes might have been allocated with respect to the other nodes that will be processed in sequence. If, after visiting one node, the next one happens to be very nearby in memory (especially within the same cacheline, but maybe also within the same virtual memory page, because of limited TLB resources), it's likely to be quicker to access than a node that is, from the cache's standpoint, in a rather random location. Caches do thing like prefetching, which means that if memory is accessed in a primarily linear manner, the cache can hide a lot of the latency of accessing main memory. The worst case would be if each node is in a completely different cache line, in a location with no discernable relation to the previous node. There's a lot more that can be said about the cache hierarchy and virtual memory performance - entire books have been written on the subject.

I would guess that the two different methods you use for building your tree result in significantly different orders of allocating the nodes in the tree, with the result that traversing the tree later has a completely different pattern of memory access that can cause noticable performance changes. I don't know exactly how you would do it, but if you could arrange for all your nodes to be contiguous in memory, in the order that you will access them during your processing, that would be nearly the best possible arrangement, and you might see even more speedup than you already are. Usually, you have to settle for "good enough", though, especially if different inputs result in a significantly different data set.

valgrind has a cachegrind module that can help simulate an approximation of how a certain cache hierarchy will work with your program - it's useful to find some of these types of differences, although it should not be taken as a serious performance guarantee, because it's a necessarily simpler model of a cache than the real thing, and can't account for multi-tasking, kernel-user context switches, etc.

(Speed) optimization effecting unexpected section of code

2 Answers2