Use this tag to ask questions about Intel® VTune™ Profiler, which is an advanced performance profiler to find and optimize performance bottlenecks across CPU, GPU, and FPGA systems.
Questions tagged [intel-vtune]
182 questions
65
votes
2 answers
Performance difference between Windows and Linux using Intel compiler: looking at the assembly
I am running a program on both Windows and Linux (x86-64). It has been compiled with the same compiler (Intel Parallel Studio XE 2017) with the same options, and the Windows version is 3 times faster than the Linux one. The culprit is a call to…

InsideLoop
- 6,063
- 2
- 28
- 55
18
votes
1 answer
Pthread Mutex: pthread_mutex_unlock() consumes lots of time
I wrote a multi-thread program with pthread, using the producer-consumer model.
When I use Intel VTune profiler to profile my program, I found the producer and consumer spend lots of time on pthread_mutex_unlock. I don't understand why this…

lei_z
- 1,049
- 2
- 13
- 27
15
votes
6 answers
Using C/Intel assembly, what is the fastest way to test if a 128-byte memory block contains all zeros?
Continuing on from my first question, I am trying to optimize a memory hotspot found via VTune profiling a 64-bit C program.
In particular, I'd like to find the fastest way to test if a 128-byte block of memory contains all zeros. You may assume any…

eyepopslikeamosquito
- 185
- 1
- 7
14
votes
4 answers
How to profile time spent in memory access in C/C++ applications?
Total Time spent by a function in an application can be broadly divided in to two components:
Time spent on actual computation (Tcomp)
Time spent on memory accesses (Tmem)
Typically profilers provide an estimate of the total time spent by a…

Imran
- 642
- 6
- 25
10
votes
2 answers
What might cause the same SSE code to run a few times slower in the same function?
Edit 3: The images are links to the full-size versions. Sorry for the pictures-of-text, but the graphs would be hard to copy/paste into a text table.
I have the following VTune profile for a program compiled with icc --std=c++14 -qopenmp -axS -O3…

iksemyonov
- 4,106
- 1
- 22
- 42
8
votes
4 answers
Profiling help required
I have a profiling issue - imagine I have the following code...
void main()
{
well_written_function();
badly_written_function();
}
void well_written_function()
{
for (a small number)
{
highly_optimised_subroutine();
…

Mick
- 8,284
- 22
- 81
- 173
8
votes
3 answers
Is VTune Worth Considering for Delphi?
Running through all the questions on profiling tools, I was surprised to discover VTune by Intel that I hadn't heard of before. At $700, it is even more expensive than AQTime.
But before I make the decision to put down the big bucks for AQTime, has…

lkessler
- 19,819
- 36
- 132
- 203
7
votes
2 answers
Optimzing SSE-code
I'm currently developing a C-module for a Java-application that needs some performance improvements (see Improving performance of network coding-encoding for a background). I've tried to optimize the code using SSE-intrinsics and it executes…

Yrlec
- 3,401
- 6
- 39
- 75
7
votes
1 answer
VTune Profiler giving Error: "The Data Cannot be displayed,there is no viewpoint available for data "
I want to optimize my code which is written in c++ on linux platform.For that i am using Intel VTune Performance Analyzer Profiler .When i am identifying Hotspots , it successfully runs the binary executable whose path i have specified and then it…

Jasdeep Singh Arora
- 543
- 2
- 11
- 31
6
votes
2 answers
MKL Performance on Intel Phi
I have a routine that performs a few MKL calls on small matrices (50-100 x 1000 elements) to fit a model, which I then call for different models. In pseudo-code:
double doModelFit(int model, ...) {
...
while( !done ) {
cblas_dgemm(...);
…

Andrew
- 867
- 7
- 20
6
votes
1 answer
Hotspot in a for loop
I am trying to optimize this code.
static
lvh_distance levenshtein_distance( const std::string & s1, const std::string & s2 )
{
const size_t len1 = s1.size(), len2 = s2.size();
std::vector col( len2+1 ), prevCol( len2+1 );
…

qdii
- 12,505
- 10
- 59
- 116
5
votes
3 answers
Vtune report Outside any known module
I am using Intel(R) VTune(TM) Amplifier XE 2013 Update 5 (build 274450) for my linux application hotspot collect, but the report says the "[Outside any known module]" consume most of the time, so i want to get more info about the unknow module.
when…

Caukie Relsis
- 51
- 3
4
votes
1 answer
When profiling, most of the time is spent in nvoglv64.dll. What should I deduce?
I am profiling a C++ application with Intel VTune Amplifier. Most of the time seems to be spent in nvoglv64.dll more precisely in DrvPresentBuffers and/or KeSynchoronizeExecution. Note that I have a NVIDA GeoForce graphic card.
I am new to the…

Palmira
- 121
- 3
- 11
4
votes
1 answer
system_call_after_swapgs, where is my code spending most of the time?
I am trying to profile my code with intel Vtune. When looking at the function call stack it looks like most of the time is spent on a function called system_call_after_swapgs. However there is no stack information. My question is:
what is…

Manfredo
- 1,760
- 4
- 25
- 53
4
votes
1 answer
What is _kmp_fork_barrier and how to see if there is load imbalance?
I'm using Intel VTune Amplifier to see how my parallel application scales.
Notice I don't use any explicit lock mechanism
It scales pretty well on my 4-cores laptop (considering that there are portions of the algorithm that can't be…

cplusplusuberalles
- 199
- 1
- 12