4

There is a loop in my code that iterates 100 million times (required for 100 million replications of a simulation model). For each of the 100 million iterations, I retrieve a value from an array (myarray) by indexing on an integer variable named age. Due to the length of the array, it is only valid to index myarray[age] for age=0,...,99. However, the actual domain of age is 0,...,inf.

So, I have the following function

int tidx(const int& a) {
    return std::min(a,99);
}

which allows indexing by myarray[tidx(age)].

How could I do this more efficiently?

[PERF OUTPUT BELOW]

An example of building a source file that illustrates the compiler flags I'm using:

Building file: ../SAR.cpp
Invoking: GCC C++ Compiler
g++ -O3 -Wall -c -fmessage-length=0 -Wno-sign-compare -fopenmp -MMD -MP -MF"SAR.d" -MT"SAR.d" -o"SAR.o" "../SAR.cpp"
Finished building: ../SAR.cpp

From perf record followed by perf report:

Samples: 280  of event 'cycles', Event count (approx.): 179855989                                                                                   
 24.78%  pc2  libc-2.17.so         [.] __GI_____strtod_l_internal
 11.35%  pc2  pc2                  [.] samplePSA(int, double, int, NRRan&)
  6.67%  pc2  libc-2.17.so         [.] str_to_mpn.isra.0
  6.15%  pc2  pc2                  [.] simulate4_NEJMdisutilities(Policy&, bool)
  5.68%  pc2  pc2                  [.] (anonymous namespace)::stateTransition(double const&, int const&, int&, double const&, bool const&, bool&, bo
  5.25%  pc2  pc2                  [.] HistogramAges::add(double const&)
  3.73%  pc2  libstdc++.so.6.0.17  [.] std::istream::getline(char*, long, char)
  3.02%  pc2  libstdc++.so.6.0.17  [.] std::basic_istream<char, std::char_traits<char> >& std::operator>><char, std::char_traits<char> >(std::basic_
  2.49%  pc2  [kernel.kallsyms]    [k] 0xffffffff81043e6a
  2.29%  pc2  libc-2.17.so         [.] __strlen_sse2
  2.00%  pc2  libc-2.17.so         [.] __mpn_lshift
  1.72%  pc2  libstdc++.so.6.0.17  [.] __cxxabiv1::__vmi_class_type_info::__do_dyncast(long, __cxxabiv1::__class_type_info::__sub_kind, __cxxabiv1::
  1.71%  pc2  libc-2.17.so         [.] __memcpy_ssse3_back
  1.67%  pc2  libstdc++.so.6.0.17  [.] std::locale::~locale()
  1.65%  pc2  libc-2.17.so         [.] __mpn_construct_double
  1.38%  pc2  libc-2.17.so         [.] memchr
  1.29%  pc2  pc2                  [.] (anonymous namespace)::readTransitionMatrix(double*, std::string)
  1.27%  pc2  libstdc++.so.6.0.17  [.] std::string::_M_mutate(unsigned long, unsigned long, unsigned long)
  1.15%  pc2  libc-2.17.so         [.] round_and_return
  1.02%  pc2  libc-2.17.so         [.] __mpn_mul
  1.01%  pc2  libstdc++.so.6.0.17  [.] std::istream::sentry::sentry(std::istream&, bool)
  1.00%  pc2  libc-2.17.so         [.] __memcpy_sse2
  0.85%  pc2  libstdc++.so.6.0.17  [.] std::locale::locale(std::locale const&)
  0.85%  pc2  libstdc++.so.6.0.17  [.] std::string::_M_replace_safe(unsigned long, unsigned long, char const*, unsigned long)
  0.83%  pc2  libstdc++.so.6.0.17  [.] std::locale::locale()
  0.73%  pc2  libc-2.17.so         [.] __mpn_mul_1

From perf stat:

 Performance counter stats for './release/pc2':

         62.449034 task-clock                #    0.988 CPUs utilized          
                49 context-switches          #    0.785 K/sec                  
                 3 cpu-migrations            #    0.048 K/sec                  
               861 page-faults               #    0.014 M/sec                  
       179,240,478 cycles                    #    2.870 GHz                    
        58,909,298 stalled-cycles-frontend   #   32.87% frontend cycles idle   
   <not supported> stalled-cycles-backend  
       320,437,960 instructions              #    1.79  insns per cycle        
                                             #    0.18  stalled cycles per insn
        70,932,710 branches                  # 1135.850 M/sec                  
           697,468 branch-misses             #    0.98% of all branches        

       0.063228446 seconds time elapsed

I would appreciate any comments. I need to learn how to interpret/read this information, so any tips that might help me get started would be appreciated.

synaptik
  • 8,971
  • 16
  • 71
  • 98
  • 4
    I don't think you could. `std::min` will most likely be inlined so there is no real function call. The only way to make it more "efficient" would be to mark your function with `inline` as well and hope the compiler inlines it. – Some programmer dude May 24 '13 at 17:55
  • 1
    Have you looked assembly it generates? It might already be very efficient. – Mysticial May 24 '13 at 17:55
  • @Mysticial I don't know assembly, so I've never tried. – synaptik May 24 '13 at 17:56
  • Is pass an `int` by reference more optimal than by value? – andre May 24 '13 at 17:57
  • @andre Wouldn't passing by reference be faster because it requires 1 fewer local variable? – synaptik May 24 '13 at 17:58
  • 3
    @andre: No, for small primitives passing by value is faster. But after optimization there's likely to be no difference anyway. – Ben Voigt May 24 '13 at 17:58
  • 6
    @synaptik No because the reference itself is a variable. – Mysticial May 24 '13 at 17:58
  • @Mysticial: At least until the optimizer gets ahold of it. – Ben Voigt May 24 '13 at 17:59
  • If performance matters at this point why not just use `myarray[std::min(age,99)];`. – andre May 24 '13 at 18:03
  • @andre Just because there are multiple locations where the array is accessed. But, it could be done, of course. – synaptik May 24 '13 at 18:04
  • 1
    @andre Assuming the definition (rather than just a declaration) of `tidx` is visible in the translation unit where it's used, there is no need to inline manually. The compiler will be able to. –  May 24 '13 at 18:13
  • Yes, definition is visible in same translation unit as calling function. – synaptik May 24 '13 at 18:13
  • 1
    You might also look into whether you can call min less often in some way. Without seeing your code, obviously there's no telling whether that's even possible. – Joel May 24 '13 at 18:19
  • Is your results above from optimized code - what are the compiler options for this build? – Mats Petersson May 24 '13 at 21:30
  • Lets assume that you do this simulation 100 million times (i.e. 100 000 000) and each one takes 5 seconds. That is 500 million seconds aka 5787 days (also know as 15 years). Even if a simulation took 1 second that would be 1 year.I think the function is less of the problem. – Ed Heal May 25 '13 at 03:11
  • It takes about a minute to run 100 million iterations of the loop. The problem is that this program is a component of a larger algorithm. Thus, the faster this part can be made to run, the better the performance of the larger algorithm. – synaptik May 25 '13 at 04:51
  • 100 000 000 (iterations)/60 (seconds) = (iterations per second) = 1666666 = (iterations per hour) = 27777 (iterations per hour) = 1157 per day = 3 per year. Please do the mathematics (and please let somebody else check the numbers!) – Ed Heal May 25 '13 at 04:59
  • Also are you trying to suggest that you can manage about 1.5e6 simulations per seconds - when is this data getting analysed? – Ed Heal May 25 '13 at 05:08
  • It might take a couple of minutes, but not long. It's a very simple simulation. Each iteration of the loop basically consists of generating 30 or so random variates, and summing 30 or so doubles. – synaptik May 25 '13 at 05:28

5 Answers5

11

In order to optimize the code one must first figure out what place is a bottleneck. To find a bottleneck, the code must be profiled. Otherwise changes are that a lot of time will be just wasted doing micro/false-optimizations that don't matter at all.

I haven't used a profiler on your minimal working code example (which you haven't provided), but based on my experience I can tell you this — your tidx() function is not a bottleneck and you should not be concerned with the performance of std::min() in this case. It is a lot more likely that the bottleneck is the memory access and stalled CPU cycles.

For starters, try unrolling your loop if possible (and if compiler haven't done that for you). It might be more efficient to perform 25000000 iterations rather than 100000000, but do more in a single loop cycle. But before you do that, you must make sure that unrolling the loop helps and not hurts. That is usually done by profiling, so we get back to the point that in order to optimize the code one must first figure out what place is a bottleneck. To find a bottleneck... Oh, wait, I almost got into infinitive loop here. Aborting.

  • I did profile. The function `tidx()` was on the very first line using `gprof`. Also, because of the nature of the underlying simulation model, which I understand, I would expect this to be the bottleneck. Also, by making other changes previously to the function `tidx()`, I obtained significant speedup. It's a substantial part. – synaptik May 24 '13 at 18:05
  • @synaptik: What instruction is a bottleneck? Function call? Memory fetch? Other? –  May 24 '13 at 18:07
  • I'm not saying it's "THE" most crucial consumer of time, but that it seems to be significant. Hold on, I'll post profiling details in a second... – synaptik May 24 '13 at 18:08
  • 2
    @synaptik Remember that this function may be at the top because it's just called very often? – Some programmer dude May 24 '13 at 18:09
  • ...correction, not "very first" line. But one of the first few lines. – synaptik May 24 '13 at 18:13
  • @synaptik: if you are using Linux, I recommend you use [perf](https://perf.wiki.kernel.org/index.php/Main_Page) to profile. –  May 24 '13 at 18:13
  • 1
    @synaptik: The profiling details you got... don't say anything at all :( Try `perf record` and `perf stat`. Also, you seem to be running non-optimized build, because I doubt that compiler won't inline such a simple thing as `std::min`. –  May 24 '13 at 18:17
  • You'd really need to look at the assembly for your code to determine if there is anything you can do to make it more efficient. I'd say in the case of `std::min` you're not likely to improve it by much or at all. At that point you need to consider that it's as fast as it is going to get unless you change your approach to the problem to avoid calling it so often. The best optimizations tend to be changes in overall algorithm and not shaving a cycle here and there. – Retired Ninja May 24 '13 at 18:47
  • 1
    I don't understand how the `std::min` function can be a bottleneck. How many assembly instructions? I don't think it should be more than 6 (depending on processor). The non-optimized function should be an "if" statement and returns the first or second argument. Since one of the arguments is a constant, I suggest you replace all occurances with an "if" statement. Redesign your code. – Thomas Matthews May 24 '13 at 19:31
  • @ThomasMatthews: This is because it is not a bottleneck. But it may seem like one even in a profiler report. For instance, it may stall on memory access and take a very long time. Also, the code OP was running is not optimized, so... –  May 24 '13 at 19:34
  • @MooingDuck: It takes two parameters by reference, those turn into pointers, and dereferenced to read the value. Of course, this all happens if compiler is not optimizing. –  May 24 '13 at 19:48
  • @VladLazarenko: oh, so it does. I'd assumed `std::min` took by value, but I was mistaken. – Mooing Duck May 24 '13 at 19:49
3

The first mistake lots of people make is to eyeball the code, home in on something or other, and wonder if they can make it faster.

The second mistake is to run gprof on it, hoping to find a "bottleneck".

The only helpful thing gprof can reliably find is if your code is CPU-bound, and it is CPU-bound in code that you compiled. It is not good at finding problems of function calls that could be done away with. It is not good at finding problems of I/O that you didn't know you were doing.

Plenty of people use this method, and here's why it works.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • Thank you. This is *very* helpful. Until now, I thought I at least had an inkling of a notion of how to profile code. But clearly I was wrong. I'm going to try `perf` soon, and read more about what you posted. – synaptik May 25 '13 at 00:12
  • @synaptik: According to the `perf` stuff you posted, it looks like 25% of PC samples were in `strtod` converting strings to doubles. Are you doing a lot of I/O? Another 11% are in `SamplePSA`, whatever that is. It doesn't look like you're getting any inclusive times, or any I/O wait times. Please try the method I linked. – Mike Dunlavey May 25 '13 at 14:52
  • Yes, `SamplePSA` randomly samples from data on prostate-specific antigen (PSA) values collected for male patients. I expect this to be the most time-consuming aspect of the simulation. However, I would *not* expect `strtod` to be called much at all. I will investigate. I will try what you linked, as soon as I can. – synaptik May 25 '13 at 16:50
  • Update: when I ran `perf`, I ran it on a very small instance of the program. (Small number of simulations.) When I run on larger instances, `strtod` is no longer a bottleneck. Thus, `strtod` is not being called too frequently for my purposes. – synaptik May 25 '13 at 17:43
  • @synaptik: run it under GDB or whatever debugger you have. Give it enough work to run for enough time. While it's running, go to the output window and hit Ctrl-C. Go back to the GDB window and do `thread 1` to put it in your working thread. Then do `bt` to see what it's doing, in detail. Do that 5 or 10 times. Whatever is taking the time will show up in proportion to the time it takes. If there's something worth fixing, that will show it to you. Most of the time it's some function call you can find a way to avoid doing. – Mike Dunlavey May 25 '13 at 18:06
2

Good advice already given about profiling.

min<T>(T x, T y) should be the equivalent of return (x < y)?x:y;

As assembler, that would become:

mov    x, %eax
cmp    y, %eax
cmovg  y, %eax

or something along those lines. These three instructions should definitely inline if you enable -O2 in gcc.

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
  • The OP's issue must be somewhere else. I can't see how 3 to 6 assembly instructions can be a bottleneck, unless one considers a cache hit, but those are registers and not in the cache. – Thomas Matthews May 24 '13 at 19:35
2

A couple of ideas to look into

  • check the generated asssembly for the function. It may or may not compile to something suboptimal
  • call the function fewer times
  • rewrite the function to use SSE instructions to compute the minimum for 4 (or 8 with AVX) values at a time
  • restructure how the function is called. Ideally, the calls shouldn't be so far apart that the code gets evicted from the instruction cache.
  • don't pass the int by const reference. What's the point in that? Just pass a copy of the int.
  • examine whether any aliasing or false sharing might slow you down at the call site.

But really, it is an extremely simple function. There's not much to be gained just by looking at its implementation. Most of my suggestions involve how the function is called, where it is called from, when it is called, and what data it is called on. Those are likely what you need to look at.

jalf
  • 243,077
  • 51
  • 345
  • 550
  • Please re-read this "There is a loop in my code that iterates 100 million times (required for 100 million replications of a simulation model" 100 milltion seconds is over 3 years. Get a calculator and do the mathematics. Summat is not true. – Ed Heal May 25 '13 at 04:36
  • @EdHeal I'm... not sure what you're trying to say. – jalf May 25 '13 at 10:41
  • @Jaif - Doing a simulation once takes x amount of time. Say x is one. The repeat for 100 million x's. That is a lot of time. Besides a simulation is to get data - so 100 million bits of data is a lot to process. So the question is a joke and also there are ways to improve performance that worrying about a small function. – Ed Heal May 25 '13 at 12:09
  • So... why are you telling *me* that? – jalf May 25 '13 at 12:44
  • @jaif - My mistake should have been synaptik. Sorry. Oops! – Ed Heal May 25 '13 at 13:03
2

Remember to profile an executable with optimization turned on. For performance testing running a non-optimized executable is useless, as it will have completely different performance characteristics.

Then consider what you can do just to avoid so many lookups. Doing less work (algorithmic improvements) will take less time.

Also write code without 'premature pesimization', as Herb Sutter likes to call it.

Definition: Premature pessimization is when you write code that is slower than it needs to be, usually by asking for unnecessary extra work, when equivalently complex code would be faster and should just naturally flow out of your fingers.

For example inlining may or may not help, but you may want to write the code so as not to preclude inlining. Later you can force or prevent inlining and compare what's better for your execution environment. You also probably don't want to use references to small types like ints. Absent optimization, a reference parameter will probably be implemented with a pointer, which today is usually larger and slower than an int. Even on 32bit hardware a reference won't be faster than an int.

int tidx(int a) {
    return std::min(a,99);
}

Then you can try some other optimization techniques; do independent tasks run in parallel? Do your data structures have good locality of reference characteristics? If things to run in parallel, are you being affected by false sharing? Can you use SIMD or other data parallelization? You can also play with compiler settings or enable specific optimizations in particular parts of the code. This is where testing performance becomes really important because different hardware can have radically different behavior. Also since most of these kinds of optimizations obfuscate code you don't want to pay that price for little or nothing in return.

bames53
  • 86,085
  • 15
  • 179
  • 244
  • I actually think the opposite of "For performance testing running a non-optimized executable is useless". Turning on the -O flags is the *last* step you should take, not the first, because speed bugs consist of doing things you don't need to do, and finding them is a debugging job, not a wishing-it-were-fast job. When you've cleaned out all the dumb stuff, then turn on the optimizer to squeeze out the last few cycles. – Mike Dunlavey May 24 '13 at 19:23
  • 1
    @MikeDunlavey Sure, for algorithmic work, avoiding premature pessimization, etc. optimization is not needed, but then you're also not doing performance testing. When you start on work that requires performance testing any work you do on the non-optimized version can end up having no effect, or worse, have a _negative_ effect on the optimized version. You want to avoid such wasted or counter productive effort. – bames53 May 24 '13 at 19:28
  • [*This is the kind of situation I'm talking about.*](http://scicomp.stackexchange.com/a/1870/1262) You start with what seems to be a perfectly normally-built not-little program. Something in it is taking 1/3 of the time, and really could be done another way. So you fix it. Then something else turns up. Then something else. Then you get to something that didn't cost much, percentage-wise, at the beginning, but now it is a big part, because you've shaved away a mass of other stuff. Before you know it, you've shaved away over 99% of the time. Then you do -O3. – Mike Dunlavey May 24 '13 at 20:27
  • People tend to think "that couldn't possibly be true of *my* code", so they turn on optimization and think they've done a good job. Maybe they run a profiler, stare at pretty output, and figure the code's blessed. Then if you say gee the whole thing's a pig, what do you get? the "mechanic's shrug" :-) – Mike Dunlavey May 24 '13 at 20:33
  • I don't see that that example is showing the value of hand optimizing code based on debug build performance. You say "Before you know it, you've shaved away over 99% of the time. Then you do -O3." But shaving away 99% of the -O0 time doesn't mean you've shaved away 99% of the -O3 time. Your link says to me not so much that tweaking non-optimized code performance is best, but that samples should be aggregated a particular way in order to usefully point out where to focus. – bames53 May 24 '13 at 20:56
  • Perhaps saying the debug version is 'useless' is too strong. If you shave off 99% of the -O0 time you've probably sped up the -O3 time as well. – bames53 May 24 '13 at 21:03
  • The problem with the optimizer is it has to assume you truly need everything you wrote, while the problem with programmers is they think they truly need everything they wrote. It's a big case of *self deception*, analogous to people thinking they don't write bugs. The way to get massive speedup is to find out, among the stuff you wrote, what you could actually do without. The optimizer can give you some percent speedup if much time is spent in call-free code you compile, but it can't tell you what you could have gotten rid of. – Mike Dunlavey May 25 '13 at 16:05