Difference between Benchmarking and Profiling

Question

I see the terms software benchmarking and profiling used sometimes interchangeably but as far as my understanding goes there's a subtile difference.

Both are connected by time. But whereas benchmarking is mainly about determining a certain speed score that can be compared with other applications, profiling gives you exact information about where your application spends most of its time (or number of cycles).

For me it was always like: integration testing is the counterpart to benchmarking and unit tesing the counterpart to profiling. But how does micro-benchmarking fit in this?

Someone stated here:

Profiling and benchmarking are flip sides of the same coin, profiling helps you to narrow down to where optimization would be most useful, benchmarking allows you to easily isolate optimizations and cross-compare them.

Another one said here about Profiling:

Profiling means different things at different times. Sometimes it means measuring performance. Sometimes it means diagnosing memory leaks. Sometimes it means getting visibility into multi-threading or other low-level activities.

So, are those techniques conceptually different or is just not that black and white?

score 46 · Accepted Answer · edited Jun 20 '20 at 09:12

A benchmark is something that measures the time for some whole operation. e.g. I/O operations per second under some workload. So the result is typically a single number, in either seconds or operations per second. Or a data set with results for different parameters, so you can graph it.

You might use a benchmark to compare the same software on different hardware, or different versions of some other software that your benchmark interacts with. e.g. benchmark max connections per second with different apache settings.

Profiling is not aimed at comparing different things: it's about understanding the behaviour of a program. A profile result might be a table of time taken per function, or even per instruction with a sampling profiler. You can tell it's a profile not a benchmark because it makes no sense to say "that function took the least time so we'll keep that one and stop using the rest".

Read the wikipedia article to learn more about it: https://en.wikipedia.org/wiki/Profiling_(computer_programming)

You use a profile to figure out where to optimize. A 10% speedup in a function where your program spends 99% of its time is more valuable than a 100% speedup in any other function. Even better is when you can improve your high-level design so the expensive function is called less, as well as just making it faster.

Microbenchmarking is a specific form of benchmarking. It means you're testing one super-specific thing to measure just that in isolation, not the overall performance of anything that's really useful.

Example microbenchmark results:

Intel Haswell's L1 cache load-use latency is 4 cycles.
This version of memcpy achieves 80% of the throughput of the other version.
mov eax, ecx has 0c latency on Haswell, but mov ecx, ecx has 1c latency. (mov-elimination only works between different registers on Intel). See that link for full asm source of a static executable, and performance-counter results from running it with a couple different loop bodies to demonstrate mov-elimination.

Using CPU performance counters to measure how a micro-benchmark runs is a good way to do experiments to find out how CPUs work internally. See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent for more examples of that. In that case, you're profiling your micro-benchmark to learn what's making it run at that speed. (Often you're more interested in the perf counters like uops_executed than you are in the actual time or clock cycle count, e.g. to test for micro-fusion / un-lamination without needing to make a loop where that actually affects cycles per iteration.)

Example non-micro benchmark results:

compressing this 100MB collection of files took 23 seconds with 7-zip (with specific options and hardware).
compiling a Linux kernel took 99 seconds on some hardware / software combination.

See also https://en.wikipedia.org/wiki/Benchmark_(computing)#Types_of_benchmarks.

Micro-benchmarking is a special case of benchmarking. If you do it right, it tells you which operations are expensive and which are cheap, which helps you while trying to optimize. If you do it wrong, you probably didn't even measure what you set out to measure at all. e.g. you wrote some C to test for loops vs. while loops, but the compiler made different code for different reasons, and your results are meaningless. (Different ways to express the same logic almost never matter with modern optimizing compilers; don't waste time on this.) Micro-benchmarking is hard.

The other way to tell it's a micro-benchmark is that you usually need to look at the compiler's asm output to make sure it's testing what you wanted it to test. (e.g. that it didn't optimize across iterations of your repeat-10M-times loop by hoisting something expensive out of the loop that's supposed to repeat the whole operation enough times to give duration that can be accurately measured.)

Micro-benchmarking can distort things, because they test your function with caches hot and branch predictors primed, and they don't run any other code between invocations of the code under test. This can make huge loop unrolling look good, when as part of a real program it would lead to more cache misses. Similarly, it makes big lookup-tables look good, because the whole lookup table ends up in cache. The full program usually dirties enough cache between calls to the function that the lookup table doesn't always hit in cache, so it would have been cheaper just to compute something. (Most programs are memory-bound. Re-computing something not too complex is often as fast as looking it up.)

Thank you for taking the time. But my confusion starts with micro-benchmarking - would that mean I just _benchmark_ only a certain function of my program? And exactly what is than the difference to _profile_ a certain function? — Jim McAdams, Sep 08 '16 at 09:28
@JimMcAdams: Yes, that's exactly the sort of thing microbenchmarking is all about: repeating the same work many times. In a profile result, drilling down to a single function would hopefully show you % of total time on a line-by-line basis. (Or instruction by instruction, since at this granularity it matters more what the asm looks like than the source.) Or you could profile recording cache misses instead of clock cycles. — Peter Cordes, Sep 08 '16 at 10:43
Nice answer, (I'm trying to learn how to explain things clearly enough) have you practiced teaching? :). — 0xc0de, Sep 15 '17 at 07:18
@0xc0de. Yeah, I like to share my knowledge, whether it's in my many (long and detailed :) SO answers or while playing Ultimate [frisbee] or in something else. — Peter Cordes, Sep 15 '17 at 07:20

score 5 · Answer 2 · answered Sep 08 '16 at 02:30

5

A benchmark can help you observe the system's behavior under load, determine the system's capacity, learn which changes are important, or see how your application performs with different data.

Profiling is the primary means of measuring and analyzing where time is consumed. Profiling entails two steps: measuring tasks and the time elapsed, and aggregating and sorting the results so that the important tasks bubble to the top. --High performance MySQL

What I understand is: benchmark is measure to know your application while profiling is measure to improve your application.

answered Sep 08 '16 at 02:30

lfree

1,880
3
24
39

Thank you for taking the time. So micro-benchmarking would mean you'd get to know your application a just little bit? But what's the benefit of that? – Jim McAdams Sep 08 '16 at 09:25
I think micro-benchmarking would mean you'd get to known your application at a specific aspect or a critical point instead of just a little bit. – lfree Sep 09 '16 at 01:52

score 3 · Answer 3 · edited May 23 '17 at 11:54

Often people do profiling not to measure how fast a program is, but to find out how to make it faster.

Often they do this on the assumption that slowness is best found by measuring the time spent by particular functions or lines of code.

There is a clear way to think about this: If a function or line of code shows an inclusive percent of time, that is the fraction of time that would be saved if the function or line of code could be made to take zero time (by not executing it or passing it off to an infinitely fast processor).

There are other things besides functions or lines of code that can take time. Those are descriptions of what the program is doing, but they are not the only descriptions.

Suppose you run a profiler that, every N seconds of actual time (not just CPU time) collects a sample of the program's state, including the call stack and data variables. The call stack is more than a stack of function names - it is a stack of call sites where those functions are called, and often the argument values. Then suppose you could examine and describe each of those.

For example, descriptions of a sample could be:

Routine X is in the process of allocating memory for the purpose of initializing a dictionary used in recording patients by routine Q when such a thing becomes necessary.
The program is in the process of reading a dll file for the purpose of extracting a string resource that, several levels up the call stack, will be used to populate a text field in a progress bar that exists to tell the user why the program is taking so long :)
The program is calling function F with certain arguments, and it has called it previously with the same arguments, giving the same result. This suggests one could just remember the prior result.
The program is calling function G which is in the process of calling function H just to decipher G's argument option flags. The programmer knows those flags are always the same, suggesting a special version of G would save that time.
etc. etc.

Those are possible descriptions. If they account for F percent of time, that is the probability that each sample will meet that description. Simpler descriptions are:

Routine or line of code X appears on Q percent of stack samples. That is measured inclusive percent.
Routine D appears immediately above routine E on R percent of stack samples. That number could be put on the arc of a call graph from D to E.
Stack sequence main->A->B->C->D->E is the sequence that appears on the largest number of samples. That is the "hot path".
The routine that appears most often at the bottom of the stacks is T. That is the "hot spot".

Most profiler tools only give you these simple descriptions. Some programmers understand the value of examining the samples themselves, so they can make more semantic descriptions of why the program is spending its time. If the objective were to accurately measure the percentage of time due to a particular description, then one would have to examine a large number of samples. But if a description appears on a large fraction of a small number of samples, one has not measured it accurately, but one knows it is large, and it has been found accurately. See the difference? You can trade off accuracy of measurement for power of speedup finding.

That's the principle behind random pausing, and the statistical justification is here.

score 1 · Answer 4 · answered Dec 23 '21 at 16:01

An example of profile

import cProfile
import re
cProfile.run('re.compile("foo|bar")')

output

    197 function calls (192 primitive calls) in 0.002 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.000    0.000    0.001    0.001 <string>:1(<module>)
     1    0.000    0.000    0.001    0.001 re.py:212(compile)
     1    0.000    0.000    0.001    0.001 re.py:268(_compile)
     1    0.000    0.000    0.000    0.000 sre_compile.py:172(_compile_charset)
     1    0.000    0.000    0.000    0.000 sre_compile.py:201(_optimize_charset)
     4    0.000    0.000    0.000    0.000 sre_compile.py:25(_identityfunction)
   3/1    0.000    0.000    0.000    0.000 sre_compile.py:33(_compile)

https://docs.python.org/3.10/library/profile.html?highlight=profile#module-profile

The profiler modules are designed to provide an execution profile for a given program,

for benchmarking purposes, there is timeit for reasonably accurate results.

This particularly applies to benchmarking Python code against C code: the profilers introduce overhead for Python code, but not for C-level functions (unfair!), and so the C code would seem faster than any Python one.

cProfile is recommended for most users; it’s a C extension with reasonable overhead that makes it suitable for profiling long-running programs. Based on lsprof

Difference between Benchmarking and Profiling

4 Answers4