Multithreaded program becomes nonresponsive on _multiple_ CPUs, but fine on a single CPU (when updating a ListView)

Question

Update: I've reproduced the problem! Scroll lower to see the code.

Quick Notes

My Core i5 CPU has 2 cores, hyperthreading.
If I call SetProcessAffinityMask(GetCurrentProcess(), 1), everything is fine, even though the program is still multithreaded.
If I don't do that, and the program is running on Windows XP (it's fine on Windows 7 x64!), my GUI starts locking up for several seconds while I'm scrolling the list view and the icons are loading.

The Problem

Basically, when I run the program posted below (a reduced version of my original code) on Windows XP (Windows 7 is fine), unless I force the same logical CPU for all my threads, the program UI starts lagging behind by half a second or so.

(Note: Lots of edits to this post here, as I investigated the problem further.)

Note that the number of threads is the same -- only the affinity mask is different.

I've tried this out using two different methods of message-passing: the built-in GetMessage as well as my own BackgroundWorker.

The result? BackgroundWorker benefits from affinity for 1 logical CPU (virtually no lag), whereas GetMessage is completely hurt by this, (lag is now many seconds long).

I can't figure out why that would be happening -- shouldn't multiple CPUs work better than a single CPU?!
Why would there be such a lag, when the number of threads is the same?

More stats:

GetLogicalProcessorInformation returns:

0x0: {ProcessorMask=0x0000000000000003 Relationship=RelationProcessorCore ...}
0x1: {ProcessorMask=0x0000000000000003 Relationship=RelationCache ...}
0x2: {ProcessorMask=0x0000000000000003 Relationship=RelationCache ...}
0x3: {ProcessorMask=0x0000000000000003 Relationship=RelationCache ...}
0x4: {ProcessorMask=0x000000000000000f Relationship=RelationProcessorPackage ...}
0x5: {ProcessorMask=0x000000000000000c Relationship=RelationProcessorCore ...}
0x6: {ProcessorMask=0x000000000000000c Relationship=RelationCache ...}
0x7: {ProcessorMask=0x000000000000000c Relationship=RelationCache ...}
0x8: {ProcessorMask=0x000000000000000c Relationship=RelationCache ...}
0x9: {ProcessorMask=0x000000000000000f Relationship=RelationCache ...}
0xa: {ProcessorMask=0x000000000000000f Relationship=RelationNumaNode ...}

The Code

The code below should shows this problem on Windows XP SP3. (At least, it does on my computer!)

Compare these two:

Run the program normally, then scroll. You should see lag.
Run the program with the affinity command-line argument, then scroll. It should be almost completely smooth.

Why would this happen?

#define _WIN32_WINNT 0x502

#include <tchar.h>
#include <Windows.h>
#include <CommCtrl.h>

#pragma comment(lib, "kernel32.lib")
#pragma comment(lib, "comctl32.lib")
#pragma comment(lib, "user32.lib")

LONGLONG startTick = 0;

LONGLONG QPC()
{ LARGE_INTEGER v; QueryPerformanceCounter(&v); return v.QuadPart; }

LONGLONG QPF()
{ LARGE_INTEGER v; QueryPerformanceFrequency(&v); return v.QuadPart; }

bool logging = false;
bool const useWindowMessaging = true;   // GetMessage() or BackgroundWorker?
bool const autoScroll = false;   // for testing

class BackgroundWorker
{
    struct Thunk
    {
        virtual void operator()() = 0;
        virtual ~Thunk() { }
    };
    class CSLock
    {
        CRITICAL_SECTION& cs;
    public:
        CSLock(CRITICAL_SECTION& criticalSection)
            : cs(criticalSection)
        { EnterCriticalSection(&this->cs); }
        ~CSLock() { LeaveCriticalSection(&this->cs); }
    };
    template<typename T>
    class ScopedPtr
    {
        T *p;
        ScopedPtr(ScopedPtr const &) { }
        ScopedPtr &operator =(ScopedPtr const &) { }
    public:
        ScopedPtr() : p(NULL) { }
        explicit ScopedPtr(T *p) : p(p) { }
        ~ScopedPtr() { delete p; }
        T *operator ->() { return p; }
        T &operator *() { return *p; }
        ScopedPtr &operator =(T *p)
        {
            if (this->p != NULL) { __debugbreak(); }
            this->p = p;
            return *this;
        }
        operator T *const &() { return this->p; }
    };

    Thunk **const todo;
    size_t nToDo;
    CRITICAL_SECTION criticalSection;
    DWORD tid;
    HANDLE hThread, hSemaphore;
    volatile bool stop;
    static size_t const MAX_TASKS = 1 << 18;  // big enough for testing

    static DWORD CALLBACK entry(void *arg)
    { return ((BackgroundWorker *)arg)->process(); }

public:
    BackgroundWorker()
        : nToDo(0), todo(new Thunk *[MAX_TASKS]), stop(false), tid(0),
        hSemaphore(CreateSemaphore(NULL, 0, 1 << 30, NULL)),
        hThread(CreateThread(NULL, 0, entry, this, CREATE_SUSPENDED, &tid))
    {
        InitializeCriticalSection(&this->criticalSection);
        ResumeThread(this->hThread);
    }

    ~BackgroundWorker()
    {
        // Clear all the tasks
        this->stop = true;
        this->clear();
        LONG prev;
        if (!ReleaseSemaphore(this->hSemaphore, 1, &prev) ||
            WaitForSingleObject(this->hThread, INFINITE) != WAIT_OBJECT_0)
        { __debugbreak(); }
        CloseHandle(this->hSemaphore);
        CloseHandle(this->hThread);
        DeleteCriticalSection(&this->criticalSection);
        delete [] this->todo;
    }

    void clear()
    {
        CSLock lock(this->criticalSection);
        while (this->nToDo > 0)
        {
            delete this->todo[--this->nToDo];
        }
    }

    unsigned int process()
    {
        DWORD result;
        while ((result = WaitForSingleObject(this->hSemaphore, INFINITE))
            == WAIT_OBJECT_0)
        {
            if (this->stop) { result = ERROR_CANCELLED; break; }
            ScopedPtr<Thunk> next;
            {
                CSLock lock(this->criticalSection);
                if (this->nToDo > 0)
                {
                    next = this->todo[--this->nToDo];
                    this->todo[this->nToDo] = NULL;  // for debugging
                }
            }
            if (next) { (*next)(); }
        }
        return result;
    }

    template<typename Func>
    void add(Func const &func)
    {
        CSLock lock(this->criticalSection);
        struct FThunk : public virtual Thunk
        {
            Func func;
            FThunk(Func const &func) : func(func) { }
            void operator()() { this->func(); }
        };
        DWORD exitCode;
        if (GetExitCodeThread(this->hThread, &exitCode) &&
            exitCode == STILL_ACTIVE)
        {
            if (this->nToDo >= MAX_TASKS) { __debugbreak(); /*too many*/ }
            if (this->todo[this->nToDo] != NULL) { __debugbreak(); }
            this->todo[this->nToDo++] = new FThunk(func);
            LONG prev;
            if (!ReleaseSemaphore(this->hSemaphore, 1, &prev))
            { __debugbreak(); }
        }
        else { __debugbreak(); }
    }
};

LRESULT CALLBACK MyWindowProc(
    HWND hWnd, UINT uMsg, WPARAM wParam, LPARAM lParam)
{
    enum { IDC_LISTVIEW = 101 };
    switch (uMsg)
    {
        case WM_CREATE:
        {
            RECT rc; GetClientRect(hWnd, &rc);

            HWND const hWndListView = CreateWindowEx(
                WS_EX_CLIENTEDGE, WC_LISTVIEW, NULL,
                WS_CHILDWINDOW | WS_VISIBLE | LVS_REPORT |
                LVS_SHOWSELALWAYS | LVS_SINGLESEL | WS_TABSTOP,
                rc.left, rc.top, rc.right - rc.left, rc.bottom - rc.top,
                hWnd, (HMENU)IDC_LISTVIEW, NULL, NULL);

            int const cx = GetSystemMetrics(SM_CXSMICON),
                cy = GetSystemMetrics(SM_CYSMICON);

            HIMAGELIST const hImgList =
                ImageList_Create(
                    GetSystemMetrics(SM_CXSMICON),
                    GetSystemMetrics(SM_CYSMICON),
                    ILC_COLOR32, 1024, 1024);

            ImageList_AddIcon(hImgList, (HICON)LoadImage(
                NULL, IDI_INFORMATION, IMAGE_ICON, cx, cy, LR_SHARED));

            LVCOLUMN col = { LVCF_TEXT | LVCF_WIDTH, 0, 500, TEXT("Name") };
            ListView_InsertColumn(hWndListView, 0, &col);
            ListView_SetExtendedListViewStyle(hWndListView,
                LVS_EX_DOUBLEBUFFER | LVS_EX_FULLROWSELECT | LVS_EX_GRIDLINES);
            ListView_SetImageList(hWndListView, hImgList, LVSIL_SMALL);

            for (int i = 0; i < (1 << 11); i++)
            {
                TCHAR text[128]; _stprintf(text, _T("Item %d"), i);
                LVITEM item =
                {
                    LVIF_IMAGE | LVIF_TEXT, i, 0, 0, 0,
                    text, 0, I_IMAGECALLBACK
                };
                ListView_InsertItem(hWndListView, &item);
            }

            if (autoScroll)
            {
                SetTimer(hWnd, 0, 1, NULL);
            }

            break;
        }
        case WM_TIMER:
        {
            HWND const hWndListView = GetDlgItem(hWnd, IDC_LISTVIEW);
            RECT rc; GetClientRect(hWndListView, &rc);
            if (!ListView_Scroll(hWndListView, 0, rc.bottom - rc.top))
            {
                KillTimer(hWnd, 0);
            }
            break;
        }
        case WM_NULL:
        {
            HWND const hWndListView = GetDlgItem(hWnd, IDC_LISTVIEW);
            int const iItem = (int)lParam;
            if (logging)
            {
                _tprintf(_T("@%I64lld ms:")
                    _T(" Received: #%d\n"),
                    (QPC() - startTick) * 1000 / QPF(), iItem);
            }
            int const iImage = 0;
            LVITEM const item = {LVIF_IMAGE, iItem, 0, 0, 0, NULL, 0, iImage};
            ListView_SetItem(hWndListView, &item);
            ListView_Update(hWndListView, iItem);
            break;
        }
        case WM_NOTIFY:
        {
            LPNMHDR const pNMHDR = (LPNMHDR)lParam;
            switch (pNMHDR->code)
            {
            case LVN_GETDISPINFO:
                {
                    NMLVDISPINFO *const pInfo = (NMLVDISPINFO *)lParam;
                    struct Callback
                    {
                        HWND hWnd;
                        int iItem;
                        void operator()()
                        {
                            if (logging)
                            {
                                _tprintf(_T("@%I64lld ms: Sent:     #%d\n"),
                                    (QPC() - startTick) * 1000 / QPF(),
                                    iItem);
                            }
                            PostMessage(hWnd, WM_NULL, 0, iItem);
                        }
                    };
                    if (pInfo->item.iImage == I_IMAGECALLBACK)
                    {
                        if (useWindowMessaging)
                        {
                            DWORD const tid =
                                (DWORD)GetWindowLongPtr(hWnd, GWLP_USERDATA);
                            PostThreadMessage(
                                tid, WM_NULL, 0, pInfo->item.iItem);
                        }
                        else
                        {
                            Callback callback = { hWnd, pInfo->item.iItem };
                            if (logging)
                            {
                                _tprintf(_T("@%I64lld ms: Queued:   #%d\n"),
                                    (QPC() - startTick) * 1000 / QPF(),
                                    pInfo->item.iItem);
                            }
                            ((BackgroundWorker *)
                             GetWindowLongPtr(hWnd, GWLP_USERDATA))
                                ->add(callback);
                        }
                    }
                    break;
                }
            }
            break;
        }
        
        case WM_CLOSE:
        {
            PostQuitMessage(0);
            break;
        }
    }
    return DefWindowProc(hWnd, uMsg, wParam, lParam);
}

DWORD WINAPI BackgroundWorkerThread(LPVOID lpParameter)
{
    HWND const hWnd = (HWND)lpParameter;
    MSG msg;
    while (GetMessage(&msg, NULL, 0, 0) > 0 && msg.message != WM_QUIT)
    {
        if (msg.message == WM_NULL)
        {
            PostMessage(hWnd, msg.message, msg.wParam, msg.lParam);
        }
    }
    return 0;
}

int _tmain(int argc, LPTSTR argv[])
{
    startTick = QPC();
    bool const affinity = argc >= 2 && _tcsicmp(argv[1], _T("affinity")) == 0;
    if (affinity)
    { SetProcessAffinityMask(GetCurrentProcess(), 1 << 0); }

    bool const log = logging;  // disable temporarily
    logging = false;

    WNDCLASS wndClass =
    {
        0, &MyWindowProc, 0, 0, NULL, NULL, LoadCursor(NULL, IDC_ARROW),
        GetSysColorBrush(COLOR_3DFACE), NULL, TEXT("MyClass")
    };
    HWND const hWnd = CreateWindow(
        MAKEINTATOM(RegisterClass(&wndClass)),
        affinity ? TEXT("Window (1 CPU)") : TEXT("Window (All CPUs)"),
        WS_OVERLAPPEDWINDOW | WS_VISIBLE, CW_USEDEFAULT, CW_USEDEFAULT,
        CW_USEDEFAULT, CW_USEDEFAULT, NULL, NULL, NULL, NULL);

    BackgroundWorker iconLoader;
    DWORD tid = 0;
    if (useWindowMessaging)
    {
        CreateThread(NULL, 0, &BackgroundWorkerThread, (LPVOID)hWnd, 0, &tid);
        SetWindowLongPtr(hWnd, GWLP_USERDATA, tid);
    }
    else { SetWindowLongPtr(hWnd, GWLP_USERDATA, (LONG_PTR)&iconLoader); }
    
    MSG msg;
    while (GetMessage(&msg, NULL, 0, 0) > 0)
    {
        if (!IsDialogMessage(hWnd, &msg))
        {
            TranslateMessage(&msg);
            DispatchMessage(&msg);
        }

        if (msg.message == WM_TIMER ||
            !PeekMessage(&msg, NULL, 0, 0, PM_NOREMOVE))
        { logging = log; }
    }

    PostThreadMessage(tid, WM_QUIT, 0, 0);
    return 0;
}

There is another question on SO about SHGetFileInfo in threads. I tried to post a link as an answer, but it seems to have vapourized.. — Martin James, Jul 04 '12 at 23:43
http://stackoverflow.com/questions/10105518/calling-shgetfileinfo-in-thread-to-avoid-ui-freeze — Windows programmer, Jul 05 '12 at 01:45
Microsoft says to use a background thread: http://msdn.microsoft.com/en-us/library/windows/desktop/bb762179(v=vs.85).aspx but Microsoft's example doesn't use a background thread: http://support.microsoft.com/kb/319350/en-us — Windows programmer, Jul 05 '12 at 01:49
Thanks, @Windowsprogrammer. I posted that link, there were no errors, and there it was, gone:( — Martin James, Jul 05 '12 at 03:39
Thanks for the link. Those are definitely interesting, but as far as I can tell they don't really address the issue here (multiple processors). — user541686, Jul 05 '12 at 03:49
Well, obviously you have to use a background thread and not use a background thread. No matter how many processors you have, you still have to meet both of these requirements. Now to be serious for a moment, since you know a workaround for XP, it looks like you should test if the OS is XP (probably Server 2003 too), and set your processor affinity in that case. Hmm, what happens on an i3 or i7? — Windows programmer, Jul 05 '12 at 04:26
@Windowsprogrammer: I don't have an i3 or i7. :( I don't think it should be a CPU issue (I've never had any problems with it) -- I only mentioned the CPU to say that it has 2 cores, hyperthreading. — user541686, Jul 05 '12 at 04:46
Well, you also mentioned the CPU issue because if you set processor affinity then the problem doesn't manifest itself even under Windows XP. So that's what you have to do, even though it's ugly and nonsensical. — Windows programmer, Jul 05 '12 at 04:50
@Windowsprogrammer: :( I'll go see if I can borrow someone's computer then... — user541686, Jul 05 '12 at 04:55
Borrowing someone else's computer will answer some curiosity about what happens with an i3 or i7. But you know what happens with an i5 and Windows XP, and you already found an ugly workaround, so you still have to do it. — Windows programmer, Jul 05 '12 at 06:45
Hyperthreading might have something to do with this. As far as I know, old OS like XP consider a single HT CPU as two cores. This can screw up scheduling, as two threads might be using the same CPU, making almost no gain from the HT while suffering from oversubscription. Mehrdad, can you turn Hyperthreading off (I think it's in the bios) and see if that makes a difference? AFAIK, new OS like Win7 do recognize HT and handle it better, which also corresponds to your finding. — Eran, Jul 08 '12 at 12:47
Also, from your [other question](http://stackoverflow.com/questions/11382085/why-does-this-trivial-win32-program-suddenly-freeze-when-i-press-the-button-and), I'm wondering if it's your XP that is malfunctioning. Try running your app after booting in safe mode, in order to have the OS as vacant as possible. — Eran, Jul 08 '12 at 13:15
@eran: The other question is Windows 7 actually. :P I also can't turn off HT, my BIOS doesn't have such an option. But I can certainly set the processor affinity to be on two different cores... hmm... — user541686, Jul 08 '12 at 17:40
@eran: I just tried setting the CPU affinity of the process (or even the individual threads) to all potential combinations I could think of, on the same and on different logical CPUs. The end result was the same -- as long as they're restricted to the same *single logical CPU*, the program is fast. But if they're on different logical CPUs (even if they're restricted to the same core), it locks up. So it doesn't seem like hyperthreading should be causing this. — user541686, Jul 08 '12 at 18:08
Ok, another 2 half-baked ideas: seems like `AddIcon` and `ReplaceIcon` are the ones that freeze your GUI. I don't know how the icon file is handled, but maybe having it opened by one CPU and read by another makes inefficient use of one of the caches. Try to measure the time it takes to `ReplaceIcon`, and then replace the same icon 10 times in a row. After the first, the cache should be up to date, so the next should be much faster. Also, just to be sure file are read directly, try running ProcMon during those stalls, and see if your app happens to look for the icons all over your PATH. — Eran, Jul 08 '12 at 19:36
@eran: Interesting... I'll see if I can give that a try. (Although the cache part would be a bit weird, since a 'stall' on the order of a minute is not merely a stall...) — user541686, Jul 08 '12 at 20:02
What?? That stall takes about a minute? I thought your scroll just wasn't smooth, not more. That's definitively not a caching issue then... I'd still give ProcMon a chance, though. Also, you do have `CImageList`'s source, and you can get a more precise location (up to an Win32 API call). Is the stack trace consistent during the stalls? What low-level function does the GUI thread hang on? — Eran, Jul 08 '12 at 20:13
@eran: Ooops, that was a typo! I meant one *second*, not one minute! >_< my bad... thanks for catching that! It's usually a large fraction of a second, occasionally a few seconds. Nowhere near a minute though! — user541686, Jul 08 '12 at 20:39
@eran: It was painful but I reproduced the problem and posted a piece of code that shows it, plus a copy of the executable. See my edit. :) — user541686, Jul 09 '12 at 02:03
I don't have a winXP box to test this on, but I would like to see some timestamps recorded at three locations in the sample: 1) at the moment that bw.add() is called, 2) at the moment that PostMessage(hWnd, WM_NULL...) is called, and 3) at the moment that the WM_NULL message is received. I'm trying to understand the distribution of latency in communcations between the two threads. Is it roughly equal, or are there larger latencies in one direction or the other? — Monroe Thomas, Jul 10 '12 at 06:29
@Mehrdad Awesome! Looks pretty clear that the UI window proc isn't servicing WM_NULL messages at nearly the rate that you are able to post them. Looks like there is a ton of LVN_GETDISPINFO messages being handled, which queues message to the background worker, which in turn posts pretty quickly back to the window proc message queue. Gotta think about this for bit. — Monroe Thomas, Jul 10 '12 at 17:11
@MonroeThomas: Yup. I'm suspecting an unknown thread scheduling problem on XP, since I can't think of any other (more likely) explanation... — user541686, Jul 10 '12 at 17:21
@Mehrdad I think that SendMessage may be your friend here; it will bypass the message queue so that the handling of WM_NULL messages is more evenly distributed. See my answer for more info. — Monroe Thomas, Jul 10 '12 at 18:17
When your program is stalled or stuttering -- WHAT IS IT DOING? If it's doing legitimate work, then that's fair. It's making forward progress as best it can. If it's doing something it's not supposed to be doing, then it has a bug you should fix. — David Schwartz, Jul 11 '12 at 06:35
@DavidSchwartz: Well, whatever it's doing (painting icons, I believe?), it's only stuttering when the threads are not on the same single CPU. What I'd like to know is why/how this is even possible, considering that the CPU affinity really shouldn't hurt performance, and that the number of CPUs is supposed to be more or less abstracted away by the system. — user541686, Jul 11 '12 at 06:41
@Mehrdad: If it has work to do, and is doing that work, you have no right to complain. If you didn't make it do the *right* work, that's your fault. If it's *not* doing any work, that's a system problem. If you wrote your code so that it does work that's not important while not doing work that is important, then your code is broken and you should fix it. (You cannot rely on the system to be "fair" because the system values performance over fairness. the system gets as much work done as possible.) — David Schwartz, Jul 11 '12 at 06:59
@DavidSchwartz: `for (int i = 0; i < 100; i++) { Spin(1000000); }` "does work" by spinning `i`, but I still have a right to complain about it if it suddenly appears inside my program. Same problem here -- I don't understand why it's happening (since the work doesn't seem to be *useful*, whatever it's doing), and I'd like to know why it happens in cases where I don't expect. — user541686, Jul 11 '12 at 07:28

Monroe Thomas · Answer 1 · 2012-07-12T17:03:45.927

5

Based on the inter-thread timings you posted at http://ideone.com/fa2fM, it looks like there is a fairness issue at play here. Based solely on this assumption, here is my reasoning as to the apparent cause of the perceived lag and a potential solution to the problem.

It looks like there is a large number of LVN_GETDISPINFO messages being generated and processed on one thread by the window proc, and while the background worker thread is able to keep up and post messages back to the window at the same rate, the WM_NULL messages it posts are so far back in the queue that it takes time before they get handled.

When you set the processor affinity mask, you introduce more fairness into the system because the same processor must service both threads, which will limit the rate at which LVN_GETDISPINFO messages are generated relative to the non-affinity case. This means that the window proc message queue is likely not as deep when you post your WM_NULL messages, which in turn means that they will be processed 'sooner'.

It seems that you need to somehow bypass the queueing effect. Using SendMessage, SendMessageCallback or SendNotifyMessage instead of PostMessage may be ways to do this. In the SendMessage case, your worker thread will block until the window proc thread is finished its current message and processes the sent WM_NULL message, but you will be able to inject your WM_NULL messages more evenly into the message processing flow. See this page for an explanation of queued vs. non-queued message handling.

If you choose to use SendMessage, but you don't want to limit the rate at which you can obtain icons due to the blocking nature of SendMessage, then you can use a third thread. Your I/O thread would post messages to the third thread, while the third thread uses SendMessage to inject icon updates into the UI thread. In this fashion, you have control of the queue of satisfied icon requests, instead of interleaving them into the window proc message queue.

As for the difference in behaviour between Win7 and WinXP, there may be a number of reasons why you don't appear to see this effect on Win7. It could be that the list view common control is implemented differently and limits the rate at which LVN_GETDISPINFO messages are generated. Or perhaps the thread scheduling mechanism in Win7 switches thread contexts more frequently or more fairly.

EDIT:

Based on your latest change, try the following:

...

                struct Callback 
                { 
                    HWND hWnd; 
                    int iItem; 
                    void operator()() 
                    { 
                        if (logging) 
                        { 
                            _tprintf(_T("@%I64lld ms: Sent:     #%d\n"), 
                                (QPC() - startTick) * 1000 / QPF(), 
                                iItem); 
                        } 
                        SendNotifyMessage(hWnd, WM_NULL, 0, iItem); // <----
                    } 
                }; 


...

DWORD WINAPI BackgroundWorkerThread(LPVOID lpParameter) 
{ 
    HWND const hWnd = (HWND)lpParameter; 
    MSG msg; 
    while (GetMessage(&msg, NULL, 0, 0) > 0 && msg.message != WM_QUIT) 
    { 
        if (msg.message == WM_NULL) 
        { 
            SendNotifyMessage(hWnd, msg.message, msg.wParam, msg.lParam); // <----
        } 
    } 
    return 0; 
}

EDIT 2:

After establishing that the LVN_GETDISPINFO message are being placed in the queue using SendMessage instead of PostMessage, we can't use SendMessage ourselves to bypass them.

Still proceeding on the assumption that there is a glut of messages being processed by the wndproc before the icon results are being sent back from the worker thread, we need another way to get those updates handled as soon as they are ready.

Here's the idea:

Worker thread places results in a synchronized queue-like data structure, and then posts (using PostMessage) a WM_NULL message to the wndproc (to ensure that the wndproc gets executed sometime in the future).
At the top of the wndproc (before the case statements), the UI thread checks the synchronized queue-like data structure to see if there are any results, and if so, removes one or more results from the queue-like data structure and processes them.

edited Jul 12 '12 at 17:03

answered Jul 10 '12 at 17:34

Monroe Thomas

4,962
1
17
21

Why exactly is it a 'fairness' issue? Aren't the threads running in parallel? What exactly is happening unfairly? – user541686 Jul 10 '12 at 20:19
@Mehrdad When running on multiple processors, the UI thread with the list view control is able to saturate the message queue with LVN_GETDISPINFO messages. There is such a large number of these initially that any WM_NULL messages that are posted from the second thread are way far back in the queue. At least, this is the conclusion I draw from the timings you posted. Look at all the `bw.Add` and `PostMessage WM_NULL` events there are before the first `case WM_NULL` event, and then notice how later on (at about 137ms) then a great number of the WM_NULL messages are handled at once. – Monroe Thomas Jul 10 '12 at 20:41
@Mehrdad To clarify (I hope!), I mean 'fair' here in the sense that the second thread is putting messages at the end of very long queue, instead of being able to inject messages to the front of the UI message queue when they are ready. When running multi-core, my theory is that the list view control is able to inject far more messages into the queue before the first WM_NULL message is processed than in the single core case, where the UI thread has to share time with the background worker. – Monroe Thomas Jul 10 '12 at 20:49
@Mehrdad So using one of the `SendMessage` variants will allow you to deliver the WM_NULL message to the window procedure when the result is ready, instead of piling it behind who knows what else messages are already in the queue. – Monroe Thomas Jul 10 '12 at 20:54
@Mehrdad An easy test is to regenerate the timings in the single core case and see if the `case WM_NULL` events are more evenly distributed. – Monroe Thomas Jul 10 '12 at 20:56
You keep on mentioning multiple *cores*, but in fact the problem is with *multiple logical CPUs*, as I mentioned many times. Two threads on the *same* core *still* cause this problem, and yet, the UI thread is still 'sharing' time with the background worker. (There's only so many messages you can enqueue in 1 time slice.) So, to me, it looks like, according to your explanation, the number of *cores* should matter (which is what you mentioned several times), *not* the number of logical CPU's. How does your answer take account of that fact? – user541686 Jul 11 '12 at 01:55
@Mehrdad Sorry for the confusion. When I say core I mean logical CPU. An HT processor can run two threads simultaneously in some circumstances. The UI thread may be happily spinning away on one logical CPU, filling up the message queue with LVN_GETDISPINFO messages while the bg worker is on the other thread running on the other logical CPU. When you set the thread affinity for one logical CPU, my theory is that the rate of LVN_GETDISPINFO messages gets cut down because the bg worker thread must also be executed by the same logical CPU. Does this explanation help? – Monroe Thomas Jul 11 '12 at 02:35
@Mehrdad The issue has nothing to do with thread scheduling. You can see from your timings that the distribution of `bw.Add` and `PostMessage WM_NULL` are very nearly equal , and that the bw worker is responding very quickly to every message sent to it. The only thing left that explains the distribution of WM_NULL messages being handled by the UI thread is that they are being queued and being significantly delayed before being processed because there are so many other messages on the queue. – Monroe Thomas Jul 11 '12 at 02:42
@Mehrdad Finally ( these are long comments! :) ), its the number of threads that can be run simultaneously that matters (which correlates with the number of logical CPUs). Because running on its own logical CPU allows the UI thread more time to fill up its message queue before the background worker starts posting to it. – Monroe Thomas Jul 11 '12 at 02:53
I'm sorry, I don't understand your reasoning when you say, *"It's the number of threads that can be run simultaneously that matters (which correlates with the number of logical CPUs). Because this allows the UI thread more time to fill up its message queue faster."* I don't see how you draw that conclusion. Yes, more parallel threads allows the UI thread's message queue to fill up quicker, but we only have 1 I/O thread here. So if they run on separate CPUs, they **both** get twice as much time to run, so they should keep up in both cases -- why would that change anything? – user541686 Jul 11 '12 at 03:02
Also, it's worth noting: while your answer is nice, it seems to me that your answer is basically just summarizing the log I posted ("message queue fills up too fast", basically), but it's not explaining anything about what you think is *causing* it. Can you predict what would happen if I changed the code around a bit? Say, what if I changed `ListView_Update` to send a different message to the list view? Would it still necessarily freeze? Or what if I used `GetQueueStatus` to check to see if the message queue is ever full? Would I ever see that happening? – user541686 Jul 11 '12 at 03:11
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/13704/discussion-between-monroe-thomas-and-mehrdad) – Monroe Thomas Jul 11 '12 at 04:06
I posted an update, maybe take a look. Seems like using `GetMessage` instead of `WaitForSingleObject` makes things *worse* with CPU affinity, but doesn't change at all without CPU affinity. – user541686 Jul 11 '12 at 06:19
@Mehrdad Hey, so the change is using `PostThreadMessage` instead of using a sempahore to synchronize from UI thread to worker? I don't expect to see very much change in behaviour at all with this change, because there never was a problem with the background worker getting messages, as per the timing log you posted. – Monroe Thomas Jul 11 '12 at 15:33
Hey, yup -- if you set `useWindowMessaging` to true, it uses `PostThreadMessage`; if you set it to false, it instead uses `BackgroundWorker` which uses a semaphore. It didn't *improve* the results to use `PostThreadMessage` -- it simply made them *much slower* with the single-CPU case (the exact opposite of the semaphore). Yeah, it surprises me as well; I don't have a good explanation. – user541686 Jul 11 '12 at 15:38
Because you are still *posting* WM_NULL messages from the background worker to the UI thread, you will continue to suffer the same problem. Think of what it is that you have control over, and what you don't. You can't control what the list view component is stuffing into the window message queue, so every solution that involves posting to the same queue will suffer the same problem. You *can* control how you inject WM_NULL messages into the message flow. Bypass the queue by using SendMessage in your callback (which won't block very long BTW, given the behaviour) or using SendNotifyMessage. – Monroe Thomas Jul 11 '12 at 15:41
@Mehrdad I posted what I think will be the solution. Same non-blocking behaviour as PostMessage, but will bypass the window message queue so that the WM_NULL messages will be processed by the window proc very shortly after they are produced. – Monroe Thomas Jul 11 '12 at 15:52
@Mehrdad Doesn't surprise me that using PostThreadMessage is slower than the semaphore for signalling the background worker. Way more machinery involved there. :) – Monroe Thomas Jul 11 '12 at 16:15
I don't see how `SendNotifyMessage` could "bypass the message queue"; that's a little nonsensical. What happens if there are ten threads using it? Do they *all* go first and bypass each other?? Anyway, I tried it just for your sake and it didn't make a single bit of difference, so I'm not really buying your theory unfortunately. :P (`SendMessage`, on the other hand, *does* work correctly -- except that it's pretty obvious and useless, because now the program is essentially single-threaded.) – user541686 Jul 12 '12 at 16:03
@Mehrdad I'm glad you at least tried it. `SendNotifyMessage` does indeed bypass the standard window message queue. See the section on non-queued messages at http://msdn.microsoft.com/en-us/library/windows/desktop/ms644927%28v=vs.85%29.aspx#routing. There is likely another higher priority queue that the `SendMessage` family gets to use, different from the queue that `PostMessage` uses. But I'm just speculating. – Monroe Thomas Jul 12 '12 at 16:06
Haha thanks, I think you'll still get (at least part of) the bounty since you helped me narrow down the problem, unless someone comes and gives a better answer. :) But I'm curious about what you're mentioning -- where exactly does it say `SendNotifyMessage` bypasses the queue? Can you paste the quote here please? :) – user541686 Jul 12 '12 at 16:07
@Mehrdad Here is a quote from the link in my previous comment: *Nonqueued messages are sent immediately to the destination window procedure, bypassing the system message queue and thread message queue.* It goes on further to say *Some functions that send nonqueued messages are BroadcastSystemMessage, BroadcastSystemMessageEx, SendMessage, SendMessageTimeout, and SendNotifyMessage.* – Monroe Thomas Jul 12 '12 at 16:09
I think this is a case where you're taking the docs for face value, without realizing what it's actually telling you. :) They're referring to (I'm 99.999% sure, it makes the most sense this way) *the cases where both the sender and the receiver are on the same thread* (i.e. where the call is just equivalent `SendMessage`), *not* in every case. :) So basically, `SendNotifyMessage` is not something you actually *need* -- you can write it yourself in terms of `SendMessage`, `PostMessage`, `GetCurrentThreadId`, and possibly `GetWindowThreadId`. It's just handy, that's all. – user541686 Jul 12 '12 at 16:11
@Mehrdad I don't think the use of `SendMessage` will result in a "single-threaded" application. The call to `SendMessage` will block the IO thread only briefly... the window proc will finish whatever message it is currently processing, and then will immediately handle your call from `SendMessage` before it processes any other messages, and then will return. Using your timing log you should be able to see nice interleaved behaviour. – Monroe Thomas Jul 12 '12 at 16:13
The I/O thread would be blocked by the window proc, in order to be able to get its return value. So unless it gets a reply sooner (hmm, `ReplyMessage` maybe? that's an interesting idea), the app will effectively be single-threaded, since everything is blocking on everything else. But even if you *do* use `ReplyMessage`, then that puts merely *one* extra message in the pipeline -- you'd still have to block on the window proc to finish handling the previous message you posted. So really, it's single-threaded either way. :) – user541686 Jul 12 '12 at 16:15
@Mehrdad No, I'm quite sure I'm correct in my interpretation. I've written code that depends on it. You can test it easily enough. – Monroe Thomas Jul 12 '12 at 16:16
Can you post an example? It doesn't make sense to me how a hundred different threads could all send "nonqueued" messages to the same thread. *Someone* has to go first! – user541686 Jul 12 '12 at 16:24
As I said in my earlier message, I am sure there are additional queues being used, but I don't know for sure how MS implements this functionality. I have looked at the WINE sources in the past, here is a comment from http://ftp.winehq.org/pub/wine/docs/en/winedev-guide.html#AEN3687: *However, in Win32 there could be several messages pending sent by preemptively executing threads, and in this case SendMessage has to build some sort of message queue for sent messages.* I'll see if I can find the WINE source for SendMessage. – Monroe Thomas Jul 12 '12 at 16:24
Are you talking about `SendMessage` or `SendNotifyMessage`? Your comment is now talking about the former... again, if you've depended on it before then an example that demonstrates the difference would be a lot more convincing than WINE. :) (Note that WINE need not be correct.) Also, if you take a look at the ReactOS code (`win32k/ntuser/message.c`), you'll notice the *very clear comment* under `UserSendNotifyMessage`, which says: *Basicly the same as IntPostOrSendMessage*, which in turn says that it either calls `PostMessage` or `SendMessage`. :P – user541686 Jul 12 '12 at 16:26
@Mehrdad After browsing the source at http://source.winehq.org/git/wine.git/blob/HEAD:/dlls/user32/message.c, in particular, the `peek_message` function, it looks like on every iteration of the message pump, all 'sent' messages are extracted from the queue and sent to the window proc in order before any 'posted' messages are processed. So effectively *bypassing* all of the posted messages. – Monroe Thomas Jul 12 '12 at 16:34
Sure, but 2 queues are still 2 queues. There's nothing "nonqueued" about them. :) (Again, I think WINE may not be correct here -- I trust ReactOS's comments more. But either way, that's trumped by the point I made above.) – user541686 Jul 12 '12 at 16:37
@Mehrdad You're getting pretty nitpicky here. All of the `SendMessage` family work the same way, except for how the calling thread waits for the result. In `SendMessage`, the calling thread blocks until the message has been executed by the target wndproc. In `SendNotifyMessage` and `SendMessageCallback` the calling thread returns immediately. However, the message they sent still effectively goes to the head of the queue. I'm not claiming WINE is correct, I was just pointing out one way that SendMessage could be implemented. They have a vested interest in ensuring correct behaviour. – Monroe Thomas Jul 12 '12 at 16:39
I'm not being nitpicky at all; I think you're just not realizing the truth: `SendNotifyMessage` *does not return* until the message has been received. Which means *you (and everyone else who used `SendNotifyMessage` before you) are blocking yourself* when you're calling `SendNotifyMessage` faster than the window proc can handle your messages. *By definition, your messages are queued.* – user541686 Jul 12 '12 at 16:41
@Mehrdad The point is not how its implemented under the hood, but how it behaves to the specification of the system. By using SendMessage, you can effectively bypass all of the messages posted using PostMessage, which has been my point all along as a way to work around the queueing behviour you have. Good luck! – Monroe Thomas Jul 12 '12 at 16:43
@Mehrdad The calling thread of `SendNotifyMessage` returns before the message is executed on the wndproc thread. Yes, the messages are queued, but they are executed, and therefore *bypass* all of the other messages queued via `PostMessage`. – Monroe Thomas Jul 12 '12 at 16:45
"Bypassing all the messages posted using `PostMessage`" is **meaningless** if there's *no* message posted by `PostMessage` in the first place!! I don't understand what you're arguing for here. Now you're just using Queue #2 instead of Queue #1, which is kind of silly! A queue is a queue (is a queue). – user541686 Jul 12 '12 at 16:45
@Mehrdad I believe all of the LVN_GESTDISPINFO messages have been queued using PostMessage... and thats what you want to bypass. – Monroe Thomas Jul 12 '12 at 16:47
I had no idea that you were trying to bypass `LVN_GESTDISPINFO`! But what tells you the system uses `PostMessage` for `LVN_GESTDISPINFO`?!! – user541686 Jul 12 '12 at 16:48
@Mehrdad, if LVN_GETDISPINFO message are sent via SendMessage instead of PostMessage, then you are correct. I was assuming they were placed in the queue via `PostMessage`, but I have no evidence to support this. I have a few minutes left that I can try to find out. – Monroe Thomas Jul 12 '12 at 16:50
@Mehrdad WINE source at http://source.winehq.org/git/wine.git/blob/HEAD:/dlls/comctl32/listview.c suggests WM_NOTIFY is sent via SendMessage. You could verify this by calling `InSendMessage` while handling the WM_NOTIFY. My bad for assuming the use of PostMessage. We can still bypass the message queue tho'. I'll update my answer. Thanks! – Monroe Thomas Jul 12 '12 at 16:56
You should seriously stop looking at WINE for everything ;) it's unnecessary and error-prone. Instead see Windows itself: http://ideone.com/X6wjf (It *couldn't* have been sent with `PostMessage`, because painting *needs* the information immediately!) And no, you're not bypassing "the message queue", you're bypassing *one of the 2 messages queues*. That means the entire difference between a multi-threaded and an almost-single-threaded app. But anyway, nice discussion, thanks. – user541686 Jul 12 '12 at 16:57
@Mehrdad My pleasure. Was interesting. Good point about the painting. – Monroe Thomas Jul 12 '12 at 17:05

score 2 · Answer 2 · answered Jul 14 '12 at 23:49

2

The issue has less to do with thread affinity and more to do with telling the listview that it needs to update the list item every time you update it. Because you do not add the LVIF_DI_SETITEM flag to pInfo->item.mask in your LVN_GETDISPINFO handler, and because you call ListView_Update manually, when you call ListView_Update, the list view invalidates any item that still has its iImage set to I_IMAGECALLBACK.

You can fix this in one of two ways (or a combination of both):

Remove ListView_Update from your WM_NULL handler. The list view will automatically redraw the items you set the image for in your WM_NULL handler when you set them, and it will not attempt to redraw items you haven't set the image for more than once.
Set LVIF_DI_SETITEM flag in pInfo->item.mask in your LVN_GETDISPINFO handler and set pInfo->item.iImage to a value that is not I_IMAGECALLBACK.

I repro'd similar awful behavior doing a full page scroll on Vista. Doing either of the above fixed the issue while still updating the icons asynchronously.

answered Jul 14 '12 at 23:49

MSN

53,214
7
75
105

Yes... although the goal of the question was understanding what could be possibly causing the CPU-related issue, *not* figuring out how to avoid the issue (I already had a few workarounds, e.g. setting the thread affinity, or using `ListView_RedrawItems` instead). – user541686 Jul 15 '12 at 08:19
@Mehrdad, if you look at the logging output doing a full page scroll (i.e., hit page down), you are queueing up O(N^2) `Callback`s. It's not a CPU related issue other than you end up dispatching multiple callbacks for a single item. – MSN Jul 15 '12 at 18:47
Why does it work fine when I set the CPU affinity to 1 logical CPU? – user541686 Jul 15 '12 at 18:57
@Mehrdad, when I set the affinity to 1 on my laptop, this issue still reproduces when I do a page-down. In general, as long as you set the item's image before the listview sends another LVN_GETDISPINFO, you'll be fine. Otherwise, without additional code, the list view will keep asking for the image, and you'll keep queueing up new callbacks, even though the first callback you queued up hasn't finished yet. – MSN Jul 15 '12 at 18:59

score 1 · Answer 3 · answered Jul 15 '12 at 01:27

1

Its plausible to suggest that this is related to XPs hyper threading/logical core scheduling and I will second IvoTops suggestion to try this with hyper-threading disabled. Please try this and let us know.

Why? Because:

a) Logical cores offer bad parallelism for CPU bound tasks. Running multiple CPU bound threads on two logical HT cores on the same physical core is detrimental to performance. See for example, this intel paper - it explains how enabling HT might cause typical server threads to incur an increase in latency or processing time for each request (while improving net throughput.)

b) Windows 7 does indeed have some HT/SMT (symmetrical multi threading) scheduling improvements. Mark Russinovich's slides here mention this briefly. Although they claim that XP scheduler is SMT aware, the fact that Windows 7 explicitly fixes something around this, implies there could be something lacking in XP. So I'm guessing that the OS isn't setting the thread affinity to the second core appropriately. (perhaps because the second core might not be idle at the instant of scheduling your second thread, to speculate wildly).
You wrote "I just tried setting the CPU affinity of the process (or even the individual threads) to all potential combinations I could think of, on the same and on different logical CPUs".

Can we try to verify that the execution is actually on the second core, once you set this?

You can visually check this in task manager or perfmon/perf counters

Maybe post the code where you set the affinity of the threads (I note that you are not checking the return value on SetProcessorAffinity, do check that as well.)

If Windows perf counters dont help, Intel's VTune Performance Analyzer is helpful for exactly this kind of stuff.

I think you can force the thread affinity manually using task manager.

One more thing: Your core i5 is either Nehalem or SandyBridge micro-architecture. Nehalem and later HT implementation is significantly different from the prior generation architectures (Core,etc). In fact Microsoft recommended disabling HT for running Biztalk server on pre-Nehalem systems. So perhaps Windows XP does not handle the new HT architecture well.

answered Jul 15 '12 at 01:27

*"I will second IvoTops suggestion to try this with hyper-threading disabled."*... did you read my comments on IvoTops's answer? – user541686 Jul 15 '12 at 01:32
@Mehrdad I did. Sorry if I have not understood your comments correctly, but it remains that you have not disabled hyperthreading. The Numproc switch limits the number of logical cores to use. It doesnt look like it allows you to restrict to one logical core from each physical core. Ergo, I was asking for more data to check if setThreadAffinity was doing the right thing (assuming you called it with the logical cores ids in separate physical cores) when you tried the different combinations mentioned above. – Jul 15 '12 at 09:24
1

@Mehrdad Right. I tested this on an older core2 duo XP box with no hyperthreading support on the CPU, and can see the issue. You can rule out HT as a cause. – Jul 15 '12 at 19:17
Ah... I had mentioned that I *can't* disable hyperthreading, which is what prompted my comment above. Yup, I've looked and `SetProcessAffinityMask` looked 100% consistent with my task manager, so I'm pretty sure it's working fine. +1 Thanks for testing it yourself, glad to know it's not an issue with my computer! – user541686 Jul 15 '12 at 19:20
@Mehrdad It appears that the system just sends the app more messages when you scroll the window down on two cores. This can be demonstrated by the following experiment: turn on logging, and using the background thread, run the exe and redirect the output to a file. Drag the scroll bar down 5 lines. Now count the "Received" messages in the log output. I find a 50% increase in messages when both cores are being used, compared to when CPU affinity is set. – Jul 15 '12 at 20:24
@Mehrdad By background thread, I meant: useWindowMessaging=false (This data and hypothesis can be explained by combining Monroes "fair scheduling" argument as well as MSN's "multiple messages" argument.) – Jul 15 '12 at 20:34
That's what's a good observation of *what's* happening, but *why*? I don't understand how CPU affinity could cause more messages to be sent... – user541686 Jul 15 '12 at 20:52
@Mehrdad The finding was that binding to a single logical core/CPU affinity is causing *less* messages to be sent, which is intuitive, I would think. – Jul 15 '12 at 21:05
(1) Sure, but on XP only?! (2) Why doesn't it work when you use the `PostMessage` message method? (3) Isn't the difference a bit *drastic*? (I'd expect a ~50% reduction or perhaps somewhere similar, but nothing vastly different like I see here.) – user541686 Jul 15 '12 at 21:10

score 0 · Answer 4 · answered Jul 08 '12 at 18:09

0

This might be a hyperthreading bug. To check if that's what causing it run your faulty program with Hyperthreading turned off (in the bios you can usually switch it off). I have run into two issues in the last five years that only surfaced when hyperthreading was enabled.

answered Jul 08 '12 at 18:09

IvoTops

3,463
17
18

Er, looks like you missed [my comment](http://stackoverflow.com/questions/11336123/multithreaded-program-becomes-nonresponsive-on-multiple-cpus-but-fine-on-a-si/11385433#comment15006856_11336123)? I can't turn off HT in the BIOS (no such option), but the affinity settings seem to indicate that's not the problem. – user541686 Jul 08 '12 at 18:13
I read your comment. You said the problem was the same when using two logical processors in one core. That's exactly what hyperthreading is, right? So it provides more evidence that hyperthreading is the problem. As another experiment, try to put each thread on a completely separate core, though it sounds like you might not be able to do that. – Windows programmer Jul 08 '12 at 23:36
@Windowsprogrammer: Er, I already tried putting them on separate cores too... I said "*even if* they're restricted to the same core*" (not that I always restricted them to the same core -- I tried both ways). That's also why I said I tested *"all potential combinations I could think of"*. Anyway, I posted a piece of code that reproduces it... maybe take a look at that? I'm convinced it's not HT, but it might be something specific to my laptop, I don't know... – user541686 Jul 09 '12 at 02:17
You can check whether it's hyperthreading multicore stuff by disabling all cores except the first. This can be done from xp boot.ini, if the bios does not allow it. See /ONECPU in http://www.pcreview.co.uk/forums/technical-information-onecpu-option-boot-ini-t3570761.html I know you said you think that's not it, but to me your issue has hyperthreading written all over it. – IvoTops Jul 09 '12 at 07:58
@IvoTops: You forgot to `@` me so I didn't get a notification. I don't understand what your experiment is supposed to show, but the result was **exactly** the same as with my `SetProcessAffinityMask` diagnosis saying 1 core versus 2 or more -- `/NumProc=1` was fine, `/NumProc=2` wasn't. Either way, it doesn't look like an HT issue in any way whatsoever -- it looks like some sort of OS scheduling bug to me, but I can't tell what. – user541686 Jul 10 '12 at 07:38