Two-threaded app is slower than single-threaded on C++ (VC++ 2010 Express). How to solve?

Question

I have some program that allocate memory a lot, I hoped to boost it's speed by splitting task on threads, but it made my program only slower.

I made this minimal example that has nothing to do with my real code aside of the fact it allocate memory in different threads.

class ThreadStartInfo
{
public:
    unsigned char *arr_of_5m_elems;
    bool TaskDoneFlag;

    ThreadStartInfo()
    {
        this->TaskDoneFlag = false;
        this->arr_of_5m_elems = NULL;
    }

    ~ThreadStartInfo()
    {
        if (this->arr_of_5m_elems)
            free(this->arr_of_5m_elems);
    }
};

unsigned long __stdcall CalcSomething(void *tsi_ptr)
{
    ThreadStartInfo *tsi = (ThreadStartInfo*)tsi_ptr;

    for (int i = 0; i < 5000000; i++)
    {
        double *test_ptr = (double*)malloc(tsi->arr_of_5m_elems[i] * sizeof(double));
        memset(test_ptr, 0, tsi->arr_of_5m_elems[i] * sizeof(double));
        free(test_ptr);
    }

    tsi->TaskDoneFlag = true;
    return 0;
}

void main()
{
    ThreadStartInfo *tsi1 = new ThreadStartInfo();
    tsi1->arr_of_5m_elems = (unsigned char*)malloc(5000000 * sizeof(unsigned char));
    ThreadStartInfo *tsi2 = new ThreadStartInfo();
    tsi2->arr_of_5m_elems = (unsigned char*)malloc(5000000 * sizeof(unsigned char));
    ThreadStartInfo **tsi_arr = (ThreadStartInfo**)malloc(2 * sizeof(ThreadStartInfo*));
    tsi_arr[0] = tsi1;
    tsi_arr[1] = tsi2;

    time_t start_dt = time(NULL);
    CalcSomething(tsi1);
    CalcSomething(tsi2);
    printf("Task done in %i seconds.\n", time(NULL) - start_dt);
    //--

    tsi1->TaskDoneFlag = false;
    tsi2->TaskDoneFlag = false;
    //--

    start_dt = time(NULL);
    unsigned long th1_id = 0;
    void *th1h = CreateThread(NULL, 0, CalcSomething, tsi1, 0, &th1_id);
    unsigned long th2_id = 0;
    void *th2h = CreateThread(NULL, 0, CalcSomething, tsi2, 0, &th2_id);

    retry:
    for (int i = 0; i < 2; i++)
        if (!tsi_arr[i]->TaskDoneFlag)
        {
            Sleep(100);
            goto retry;
        }

    CloseHandle(th1h);
    CloseHandle(th2h);

    printf("MT Task done in %i seconds.\n", time(NULL) - start_dt);
}

It prints me such results:

Task done in 16 seconds.
MT Task done in 19 seconds.

And... I didn't expected slow down. Is there anyway to make memory allocations faster in multiple threads?

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/204819/discussion-on-question-by-kosmos-two-threaded-app-is-slower-than-single-threaded). — Samuel Liew, Dec 25 '19 at 00:51

rustyx · Accepted Answer · 2019-12-24T09:55:31.147

2

Apart from some undefined behavior due to lack of synchronization on TaskDoneFlag, all the threads are doing is calling malloc/free repeatedly.

The Visual C++ CRT heap is single-threaded¹, as malloc/free delegate to HeapAlloc/HeapFree which execute in a critical section (only one thread at a time). Calling them from more than one thread at a time will never be faster than a single thread, and often slower due to the lock contention overhead.

Either reduce allocations in threads or switch to another memory allocator, like jemalloc or tcmalloc.

¹ See this note for HeapAlloc:

Serialization ensures mutual exclusion when two or more threads attempt to simultaneously allocate or free blocks from the same heap. There is a small performance cost to serialization, but it must be used whenever multiple threads allocate and free memory from the same heap. Setting the HEAP_NO_SERIALIZE value eliminates mutual exclusion on the heap. Without serialization, two or more threads that use the same heap handle might attempt to allocate or free memory simultaneously, likely causing corruption in the heap.

edited Dec 24 '19 at 09:55

answered Dec 24 '19 at 09:49

rustyx

80,671
25
200
267

*undefined behavior due to lack of synchronization on TaskDoneFlag* - i not think so. use `TaskDoneFlag` such as in example very not best way (good way use say interlocked decrement thread in task count and when it reach zero - set event, main thread wait on this event, instead sleep in loop). but i think not need any synchronization on `TaskDoneFlag` anyway. compiler can not drop `tsi->TaskDoneFlag = true;` line (or execute it more early). like and can not reorder or drop read from `tsi_arr[i]->TaskDoneFlag` (between unknow for him `Sleep`) and `TaskDoneFlag` atomic by self - true or false. – RbMm Dec 24 '19 at 11:30
1

A bool is not atomic, [not even on x86](https://stackoverflow.com/questions/14624776/can-a-bool-read-write-operation-be-not-atomic-on-x86) (otherwise there wouldn't be a need for a LOCK instruction prefix). Accessing the same variable without synchronization is UB. However, given the strong x86 memory model and the VC++10 compiler, the end result will probably be as expected. This doesn't mean the code is portable to other platforms (but yes, that probably doesn't matter in this case). – rustyx Dec 24 '19 at 13:00
bool is atomic in sense that only 2 values can be here - 0 (false) or not 0 (true) because this it and atomic - we can view only false(0) or true (any not 0) we can not view some "partial" modification – RbMm Dec 24 '19 at 13:10
and answer on your link not quit correct. even from formal c++ view - *bool* have only 2 stages, we can not read something wich not true and not false here. and lock need only in case when we do RMW operation . but in case we only write or read from bool - lock not need. and memory model here unrelated at all (in concrete code example) – RbMm Dec 24 '19 at 13:16
and platform (how bool implemented, how many bytes it hold) here not related - *bool - type, capable of holding one of the two values: true or false.* - this is main. so with or without any "syncronization" we read from *bool* true or false. memory model - is about order of access different memory locations, it visble and modification order. but here only single memory location. code is not optimal. but it not UB, and not platform depended – RbMm Dec 24 '19 at 13:26
so i be say - even if hardware impementation write and read to *bool* storage not atomic , the compiler by self interpret readed value only as true or as false. and if one thread write true to it (even with several hardware stores) - another thread in loop - at some time read true from it (even if several loads used - final result is interpreted as true or false) – RbMm Dec 24 '19 at 13:33

Two-threaded app is slower than single-threaded on C++ (VC++ 2010 Express). How to solve?

1 Answers1