Run time issues using the Intel's Math Kernel Library for eigendecomposition

Question

to do the the eigendecomposition in C++, I use the routine "zhpev". This routine is embedded in a dll file of a larger piece of software and is exhaustively used during the run time. After ~5000 calls of "zhpev" I measure the run time. Everything is fine for the first 900 run time evaluations. The run time is around 0.7 seconds with little variation. However, 900 run time evaluations, the run time suddenly increases from 0.7 seconds up to 2.7 seconds with heavy variation.

I made the following observations:

The run time issues occur independently to the input data of "zhpev".
The routine "zhpev" runs fine and stable in a small program. It seems that the interaction with other parts causes the trouble.
After replacing "zhpev" with another routine for eigendecomposition, everything runs smoothly with little variations in run time.
The run time issues occur with and without multi-threading.
I do not use dynamic memory allocation. All variables are allocated as static variables.
The problem is similar to Visual C++ function suddenly 170 ms slower (4x longer) , however, I could not detect any memory leaks in my code.

Sorry that I could not post any code since the project I'm working with is too large.

I would appreciate any hint which helps me to stop this strange behaviour!

Edit The routine "zhpev" works on complex Hermitian matrices with size 32x32 in double precision. So, the chunks of data processed at a time are rather small.

Update 1) Paging is not the problem here. I disabled the paging file in the system options. The run time issue is still not fixed. 2) Running the application on a different Windows computer also leads to the same run time issues. However, the starting of the run time increase occurs now later, after 1400 run time evaluations.

Update I found out that the run time problems occur only if I call "zhpev" inside a thread. With this I can create a small code example where I run into the same problems.

Let me explain my code

I have used the multi-thread example of http://msdn.microsoft.com/en-us/library/windows/desktop/ms682516(v=vs.85).aspx
In the thread function, I inserted the function "eigendecomposition" which contains "zhpev"
I also included functions to measure the run time.

This is my code

#include <windows.h>
#include <tchar.h>
#include <strsafe.h>
#include "stdafx.h"
#include "mkl_lapack.h"
#include "mkl_service.h"
#include <time.h>
#include <stdio.h>
#include <iostream>
#include <fstream>
#include <stdlib.h>
#include <iostream> 

using namespace std;
#define CACHE_LINE  32
#define CACHE_ALIGN __declspec(align(CACHE_LINE))

#define MAX_THREADS 2
#define BUF_SIZE 255

DWORD WINAPI MyThreadFunction( LPVOID lpParam );
void ErrorHandler(LPTSTR lpszFunction);

// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
// This is the critical function.
void Eigendecomposition();
// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

typedef struct MyData {
    int val1;
    int val2;
} MYDATA, *PMYDATA;



int _tmain()
{
    PMYDATA pDataArray[MAX_THREADS];
    DWORD   dwThreadIdArray[MAX_THREADS];
    HANDLE  hThreadArray[MAX_THREADS]; 

    std::ofstream ofs;

    double tstart;
    double tend;

    double proc_time_pure;

    for(int j=0;j<10000;j++){

    // Start one iteration
    tstart = clock(); 



    // Create MAX_THREADS worker threads.

    for( int i=0; i<MAX_THREADS; i++ )
    {


        pDataArray[i] = (PMYDATA) HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY,
                sizeof(MYDATA));

        if( pDataArray[i] == NULL )
        {

            ExitProcess(2);
        }



        pDataArray[i]->val1 = i;
        pDataArray[i]->val2 = i+100;

        // Create the thread to begin execution on its own.

        hThreadArray[i] = CreateThread( 
            NULL,                   // default security attributes
            0,                      // use default stack size  
            MyThreadFunction,       // thread function name
            pDataArray[i],          // argument to thread function 
            0,                      // use default creation flags 
            &dwThreadIdArray[i]);   // returns the thread identifier 




        if (hThreadArray[i] == NULL) 
        {
           ErrorHandler(TEXT("CreateThread"));
           ExitProcess(3);
        }
    } // End of main thread creation loop.

    // Wait until all threads have terminated.

    WaitForMultipleObjects(MAX_THREADS, hThreadArray, TRUE, INFINITE);



    for(int i=0; i<MAX_THREADS; i++)
    {
        CloseHandle(hThreadArray[i]);
        if(pDataArray[i] != NULL)
        {
            HeapFree(GetProcessHeap(), 0, pDataArray[i]);
            pDataArray[i] = NULL;    // Ensure address is not reused.
        }
    }

    tend = clock();
    proc_time_pure = tend-tstart;

    // Print processing time into console and write it into a file
    printf("   Processing time: %4.3f \n", proc_time_pure/1000.0);
    ofs.open ("Processing_time.txt", std::ofstream::out | std::ofstream::app);

      ofs << proc_time_pure/1000.0 << " ";

      ofs.close();
    }
    return 0;
}


DWORD WINAPI MyThreadFunction( LPVOID lpParam ) 
{ 
    HANDLE hStdout;
    PMYDATA pDataArray;

    TCHAR msgBuf[BUF_SIZE];
    size_t cchStringSize;
    DWORD dwChars;



    hStdout = GetStdHandle(STD_OUTPUT_HANDLE);
    if( hStdout == INVALID_HANDLE_VALUE )
        return 1;



    pDataArray = (PMYDATA)lpParam;
    // !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   // Critical function
    Eigendecomposition();
    // !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    return 0; 
} 



void ErrorHandler(LPTSTR lpszFunction) 
{ 
    // Retrieve the system error message for the last-error code.

    LPVOID lpMsgBuf;
    LPVOID lpDisplayBuf;
    DWORD dw = GetLastError(); 

    FormatMessage(
        FORMAT_MESSAGE_ALLOCATE_BUFFER | 
        FORMAT_MESSAGE_FROM_SYSTEM |
        FORMAT_MESSAGE_IGNORE_INSERTS,
        NULL,
        dw,
        MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT),
        (LPTSTR) &lpMsgBuf,
        0, NULL );

    // Display the error message.

    lpDisplayBuf = (LPVOID)LocalAlloc(LMEM_ZEROINIT, 
        (lstrlen((LPCTSTR) lpMsgBuf) + lstrlen((LPCTSTR) lpszFunction) + 40) * sizeof(TCHAR)); 
    StringCchPrintf((LPTSTR)lpDisplayBuf, 
        LocalSize(lpDisplayBuf) / sizeof(TCHAR),
        TEXT("%s failed with error %d: %s"), 
        lpszFunction, dw, lpMsgBuf); 
    MessageBox(NULL, (LPCTSTR) lpDisplayBuf, TEXT("Error"), MB_OK); 

    // Free error-handling buffer allocations.

    LocalFree(lpMsgBuf);
    LocalFree(lpDisplayBuf);
}

void Eigendecomposition(){
    const int M = 32;
    typedef MKL_Complex16  double_complex;
    const char    jobz = 'V';
    const char    uplo = 'L'; // lower triangular part of input matrix is used
    const MKL_INT dim = M;
    const MKL_INT ldz = M;
    const MKL_INT LWORK = (2*M-1);
    const MKL_INT LRWORK = (3*M-2);
    MKL_INT       info = 0;


    double_complex A_H_MKL[(M*M+M)/2];

    CACHE_ALIGN double_complex       work[LWORK]; 
    CACHE_ALIGN double               rwork[LRWORK];

    double D[M];
    double_complex U[M][M];
    for(int i=0;i<500;i++ ){
    // Create the input matrix
    for (int tmp=0; tmp < (M*M+M)/2; tmp++){
        A_H_MKL[tmp].real = 1  ;
        A_H_MKL[tmp].imag = 0;}

    // This is the mkl function
        zhpev(&jobz,                                // const char* jobz,
          &uplo,                                // const char* uplo,
          &dim,                                 // const MKL_INT* n,
          (double_complex *)&A_H_MKL[0],        // double_complex* ap,
          (double *)&D[0],                      // double* w,
          (double_complex *)&U[0][0],           // double_complex* z,
          &ldz,                                 // const MKL_INT* ldz,
          work,                                 // double_complex* work,
          rwork,                                // double* rwork,
          &info);                               // MKL_INT* info


}
}

Resource leaks (memory or other) doesn't have to be in your code, it may actually be a bug in the library. — Some programmer dude, Jul 18 '14 at 06:59
Thanks, Joachim. However, running the routine "zhpev" inside a small test function without any large environment does not provoke a run time increase. Therefore, I think the source of the problem is a combination of the "zhpev" and some issues with my code. — The Pheromone Kid, Jul 18 '14 at 07:07
Then have you thought about that the function might need to do more calculations for certain sets of arguments and/or data? Or that you might have more data in the slower calls? Or are you calling the function with the exact same arguments and with the same data sizes every time? — Some programmer dude, Jul 18 '14 at 07:09
The input data is not the problem here. Calling the function with the same input causes the same behaviour as with varying input data. — The Pheromone Kid, Jul 18 '14 at 07:25
You already mentioned that memory should not be the issue. But did you monitor how memory develops at run-time? (For Win I recommend the utility "Address Space Monitor".) Are you sure that the "large environment" doesn't generate data on the fly? — normanius, Jul 18 '14 at 08:44
Other things that I just thought of: Are parts of your application multi-threaded? Or do you make heavy use of recursive function calls (=> you might run into some call stack limitation of your system) — normanius, Jul 18 '14 at 08:52
I monitored the usage of the memory consumption. Around 500 MB out of 2GB are used in total. The memory consumption increases within the first run time evaluations but then saturates at around 500 MB. — The Pheromone Kid, Jul 18 '14 at 08:59
This is quite a big memory footprint. What about fragmentation? zphev probably tries to allocate big junks of data at a time. If there is no contiguous block of free memory space, paging might be turned on by the memory manager surprisingly early... — normanius, Jul 18 '14 at 09:04
@normanius: The run time problems occur with and without multi-threading. Moreover, I do not make recursive function calls. — The Pheromone Kid, Jul 18 '14 at 09:36
@normanius: How can I detect possible fragmentation and how can I avoid it? — The Pheromone Kid, Jul 18 '14 at 09:38
Are you on Win? Try the Address Space Monitor Utility. It's simple, but I use it to get an idea how fragmented the memory is. You can get it from here: http://hashpling.org/asm/ — normanius, Jul 18 '14 at 09:53

score 0 · Answer 1 · edited May 23 '17 at 12:26

0

Because I'm lacking the details I can only try to give a general answer. I already mentioned memory fragmentation as a possible cause for longer run-time cycles in my comments. Similarly, caching might turn into a bottleneck in the course of software execution. How do the other parts of your application operate? Do they also heavily process data?

Stackoverflowers discussed caching topics in many threads already. Read for example What is cache friendly code. I think some posts are very informative and maybe help you to understand your problem.

edited May 23 '17 at 12:26

Community

1
1

answered Jul 22 '14 at 08:45

normanius

8,629
7
53
83

Thank you Normanius. Unfortnuately, I cannot use ASM that you have recommended as I am not allowed to install anything on the computer. This is the policy of my company. – The Pheromone Kid Jul 22 '14 at 09:31
1

Wouldn't caching problems also affect other parts of the software execution? I have to note that all other parts of the software run smooth and fine without any difficulties. Also, the other parts of the software process big amounts of data. – The Pheromone Kid Jul 22 '14 at 10:45
It depends if the rest of your code is also cache optimized or not... Without knowing much about zhpev, LAPACK in general "was designed to effectively exploit the caches on modern cache-based architectures" (see Wiki on LAPACK). So the amount of traffic over the system cache might have an effect on the run-time characteristics of (some) LAPACK algorithms. It's just a guess. What you could do: 1) Compare on different systems with different amount of CPU cache. 2) Look at the cache traffic caused by your application. Cache miss rate would be an interesting measure. – normanius Jul 23 '14 at 08:46
Or this thread: http://stackoverflow.com/questions/10122520/profiling-cpu-cache-memory-from-the-os-application – normanius Jul 23 '14 at 08:48

score 0 · Answer 2 · answered Jul 27 '14 at 07:47

As normanius already stated, it is hard to guess what might be the problem without actually seeing your code and implementation of zhpev. However, beside memory fragmentation and caching issues, the problem might be related to the properties of your Hermitian matrices.

From what I have read about zhpev, there is no doubt that it is based on some iterative method of numerical analysis. Therefore, depending on properties of used method, time (number of iterations) which is needed by zhpev to converge may vary quite widely depending on properties of input matrix.

Other matrix-property reason might be that zhpev has several numerical methods at hand and based on input matrix analysis it chooses the one that can best exploit it's properties to compute it faster.

You state that input data have no affect on the performance issue, however can you be sure about that? Doesn't your Hermitian matrices somehow evolve during computation? Would it be possible to fetch zhpev, alongside your actual eigenvalues computation, every time the same matrix and compare resulting time? (Of course, you need to ensure, that it won't be optimized out)

Input data is not the problem. See the code example that I have posted. — The Pheromone Kid, Jul 30 '14 at 11:59

score 0 · Accepted Answer · answered Jul 31 '14 at 09:47

I found the bug in my code. I run the routine for eigendecomposition inside a thread that was created with the windows function CreatThread. However, there was no function to end the thread as for example the WaitForMultipleObjects-routine. For all the other parts of my application this was not a problem but the eigendecomposition run into difficulties.

Run time issues using the Intel's Math Kernel Library for eigendecomposition

3 Answers3