Profiling code built from ifort 11.1 yields __powr8i4 routine, what is it?

Question

I built a Fortran code with Intel 11.1. I built it with the -p option in order to produce profiling data. When I check these results, there are some routines present that aren't a part of my code. I assume they were put there by Intel. The include:

__powr8i4
__intel_new_memset
__intel_fast_memset
__intel_fast_memset.J
__intel_fast_memcpy
__intel_new_memcpy
__intel_fast_memcpy.J

There are others, too. When I build the code without optimization, the code doesn't spend much time in them. Except that results show __powr8i4 being used 3.3% of the time. However, when I build the code with optimization, this number goes way up to about 35%. I can't seem to find out what these routines are, but they are confusing my results because I want to know where to look to optimize my code.

I still have never found a specific answer to my question, but what I **think** "powr8i4" is, is a function in the Intel math library that calculates the power of something. In other words, if you specify x**2, that operation will be done by "powr8i4". By stopping my code during debugging, the stack trace points to this library when the code is on a power function (e.g., x**2), which is what makes me think this is the case. In other words, it seems my code is spending a lot of time doing power operations. — rks171, Dec 03 '12 at 16:29

score 0 · Answer 1 · edited May 23 '17 at 12:04

Most programs spend a lot of their cycles in the calling of subroutines, often library subroutines, so if you look only at exclusive (self) time, you will see what you are seeing.

So point 1 is look at inclusive (self plus callees) time.

Now, if the profiler is a "CPU profiler", it will probably be blind to I/O time. That means your program might be spending most of its time reading or writing, but the profiler will give you no clue about that.

So point 2 is use a profiler that works on "wall clock" time, not "CPU" time, unless you are sure you are not doing much I/O. (Sometimes you think you're not doing I/O, but deep inside some subroutine layers deep, guess what - it's doing I/O.)

Many profilers try to produce a call-graph, and if your program does not contain recursion, and if the profiler has access to all the routines in your code, that can be helpful in identifying the subroutine calls in your code that account for a lot of time. However, if routine A is large and calls B in several places, the profiler won't tell you which lines of code to look at.

Point 3 is use a profiler that gives you line-level inclusive time percentage, if possible. (Percentage is the most useful number, because that tells you how much overall time you would save if you could somehow remove that line of code. Also, it is not much affected by competing processes in the system.) One example of such a profiler is Zoom.

It may be that after you do all this, you don't see much you could do to speed up the code. However, if you could see how certain properties of the data might affect performance, you might find there were further speedups you could get. Profilers are unable to look at data.

What I do is randomly sample the state of the program under the debugger, and see if I can really understand what it is doing at each sample. You can find things that way that you can't find any other way. (Some people say this is not accurate, but it is accurate - about what matters. What matters is what the problem is, not precisely how much it costs.) And that is point 4.

Profiling code built from ifort 11.1 yields __powr8i4 routine, what is it?

1 Answers1