Performance of map pattern in multithreaded program lower than expected (4x speedup vs 8x)

Question

I'm getting started in multithreaded programming so please excuse me if the following seems obvious. I am adding multithreading to an image processing program and the speedup isn't exactly the one I expected.

I'm currently getting a speedup of 4x times on a 4 physical processor cpu with hyperthreading (8), so I'd like to know if this kind of speedup is expected. The only thing I can think of is that it may make sense if both hyperthreads of a single physical CPU have to share some sort of memory bus.

Being new to multithreading it's not entirely clear to me if this would be considered an I/O bound program considering that all memory is allocated in RAM (I understand that the virtual memory manager of my OS will be the one deciding to page in/out this supposed memory amount from the heap) My machine has 16Gb of RAM in case it helps deciding if paging/swapping can be an issue.

I've written a test program showcasing the serial case and two parallel cases using QThreadPool and tbb::parallel_for

The current program as you can see has no real operations other than setting a supposed image from black to white and it's done on purpose to know what the baseline is before any real operations are applied to the image.

I'm attaching the program in hope that someone can explain me if my quest for a roughly 8x speedup is a lost cause in this kind of processing algorithm. Note that I'm not interested in other kinds of optimizations such as SIMD as my real concern is not just to make it faster, but to make it faster using purely multithreading, without getting into SSE nor processor cache level optimizations.

#include <iostream>
#include <sys/time.h>

#include <vector>
#include <QThreadPool>
#include "/usr/local/include/tbb/tbb.h"

#define LOG(x) (std::cout << x << std::endl)

struct col4
{
    unsigned char r, g, b, a;
};

class QTileTask : public QRunnable
{
public:
    void run()
    {
        for(uint32_t y = m_yStart; y < m_yEnd; y++)
        {
            int rowStart = y * m_width;
            for(uint32_t x = m_xStart; x < m_xEnd; x++)
            {
                int index = rowStart + x;
                m_pData[index].r = 255;
                m_pData[index].g = 255;
                m_pData[index].b = 255;
                m_pData[index].a = 255;
            }
        }
    }

    col4*          m_pData;
    uint32_t       m_xStart;
    uint32_t       m_yStart;
    uint32_t       m_xEnd;
    uint32_t       m_yEnd;
    uint32_t       m_width;
};

struct TBBTileTask
{
    void operator()()
    {
        for(uint32_t y = m_yStart; y < m_yEnd; y++)
        {
            int rowStart = y * m_width;
            for(uint32_t x = m_xStart; x < m_xEnd; x++)
            {
                int index = rowStart + x;
                m_pData[index].r = 255;
                m_pData[index].g = 255;
                m_pData[index].b = 255;
                m_pData[index].a = 255;
            }
        }
    }

    col4*          m_pData;
    uint32_t       m_xStart;
    uint32_t       m_yStart;
    uint32_t       m_xEnd;
    uint32_t       m_yEnd;
    uint32_t       m_width;
};

struct TBBCaller
{
    TBBCaller(std::vector<TBBTileTask>& t)
        : m_tasks(t)
    {}

    TBBCaller(TBBCaller& e, tbb::split)
        : m_tasks(e.m_tasks)
    {}

    void operator()(const tbb::blocked_range<size_t>& r) const
    {
        for (size_t i=r.begin();i!=r.end();++i)
            m_tasks[i]();
    }

    std::vector<TBBTileTask>& m_tasks;
};

inline double getcurrenttime( void )
{
    timeval t;
    gettimeofday(&t, NULL);
    return static_cast<double>(t.tv_sec)+(static_cast<double>(t.tv_usec) / 1000000.0);
}

char* getCmdOption(char ** begin, char ** end, const std::string & option)
{
    char ** itr = std::find(begin, end, option);
    if (itr != end && ++itr != end)
    {
        return *itr;
    }
    return 0;
}

bool cmdOptionExists(char** begin, char** end, const std::string& option)
{
    return std::find(begin, end, option) != end;
}

void baselineSerial(col4* pData, int resolution)
{
    double t = getcurrenttime();
    for(int y = 0; y < resolution; y++)
    {
        int rowStart = y * resolution;
        for(int x = 0; x < resolution; x++)
        {
            int index = rowStart + x;
            pData[index].r = 255;
            pData[index].g = 255;
            pData[index].b = 255;
            pData[index].a = 255;
        }
    }
    LOG((getcurrenttime() - t) * 1000 << " ms. (Serial)");
}

void baselineParallelQt(col4* pData, int resolution, uint32_t tileSize)
{
    double t = getcurrenttime();

    QThreadPool pool;
    for(int y = 0; y < resolution; y+=tileSize)
    {
        for(int x = 0; x < resolution; x+=tileSize)
        {
            uint32_t xEnd = std::min<uint32_t>(x+tileSize, resolution);
            uint32_t yEnd = std::min<uint32_t>(y+tileSize, resolution);

            QTileTask* t = new QTileTask;
            t->m_pData = pData;
            t->m_xStart = x;
            t->m_yStart = y;
            t->m_xEnd = xEnd;
            t->m_yEnd = yEnd;
            t->m_width = resolution;
            pool.start(t);
        }
    }
    pool.waitForDone();
    LOG((getcurrenttime() - t) * 1000 << " ms. (QThreadPool)");
}

void baselineParallelTBB(col4* pData, int resolution, uint32_t tileSize)
{
    double t = getcurrenttime();

    std::vector<TBBTileTask> tasks;
    for(int y = 0; y < resolution; y+=tileSize)
    {
        for(int x = 0; x < resolution; x+=tileSize)
        {
            uint32_t xEnd = std::min<uint32_t>(x+tileSize, resolution);
            uint32_t yEnd = std::min<uint32_t>(y+tileSize, resolution);

            TBBTileTask t;
            t.m_pData = pData;
            t.m_xStart = x;
            t.m_yStart = y;
            t.m_xEnd = xEnd;
            t.m_yEnd = yEnd;
            t.m_width = resolution;
            tasks.push_back(t);
        }
    }

    TBBCaller caller(tasks);
    tbb::task_scheduler_init init;
    tbb::parallel_for(tbb::blocked_range<size_t>(0, tasks.size()), caller);

    LOG((getcurrenttime() - t) * 1000 << " ms. (TBB)");
}

int main(int argc, char** argv)
{
    int resolution = 1;
    uint32_t tileSize = 64;

    char * pResText = getCmdOption(argv, argv + argc, "-r");
    if (pResText)
    {
        resolution = atoi(pResText);
    }

    char * pTileSizeChr = getCmdOption(argv, argv + argc, "-b");
    if (pTileSizeChr)
    {
        tileSize = atoi(pTileSizeChr);
    }

    if(resolution > 16)
        resolution = 16;

    resolution = resolution << 10;

    uint32_t tileCount = resolution/tileSize + 1;
    tileCount *= tileCount;

    LOG("Resolution: " << resolution << " Tile Size: "<< tileSize);
    LOG("Tile Count: " << tileCount);

    uint64_t pixelCount = resolution*resolution;
    col4* pData = new col4[pixelCount];

    memset(pData, 0, sizeof(col4)*pixelCount);
    baselineSerial(pData, resolution);

    memset(pData, 0, sizeof(col4)*pixelCount);
    baselineParallelQt(pData, resolution, tileSize);

    memset(pData, 0, sizeof(col4)*pixelCount);
    baselineParallelTBB(pData, resolution, tileSize);

    delete[] pData;

    return 0;
}

Hyperthreading only helps when your code has lots of branch mispredicts and/or has lots of code waiting on memory stalls. (It allows thread B to execute on the same core while thread A is waiting for data or is not using parts of the execution units) Your code looks like it's bitblasting which will quickly become memory bandwidth limited. — Billy ONeal, Aug 06 '15 at 20:32
4x speedup on a quad-core processor is impressive already regardless of HT. I can't count how many SO questions I see about people getting no speedup or negative speedup. — Mysticial, Aug 06 '15 at 22:05
hey thanks guys, @BillyONeal I didn't fully understand how hyperthreading works until I saw your reply, looks like I'll have to catch up and read some books about the HT architecture and how to get the best of it! ;) — Exaberri Tokugawa, Aug 10 '15 at 13:01

score 6 · Accepted Answer · edited May 23 '17 at 10:28

6

Yes, 4x speedup is expected. Hypertreading is a kind of time sharing implemented in hardware, so you can't expect to benefit from it if one thread is using up all superscalar pipelines available on the core, as it is your case. The other thread will necessarily have to wait.

You can expect an even lower speedup if your memory bus bandwidth is saturated by the threads running in less than the total number of cores available. Usually happens if you have too many cores, like in this question:

Why doesn't this code scale linearly?

edited May 23 '17 at 10:28

Community

1
1

answered Aug 06 '15 at 20:45

lvella

12,754
11
54
106

@Ivella Thanks for your reply and specially for your second paragraph! I think I'm starting to understand a bit better how things work, regarding what you're saying of memory bus saturation, you're completely right, I've added some more code to build up on the simple example I posted here and noticed that in some instances I get speedups by *not* precomputing data and redoing it per thread as this way I'm freeing the bus for other queries. It's quite an unusual thing in single threaded programming not precomputing data but in this case makes a lot of sense thanks! – Exaberri Tokugawa Aug 10 '15 at 12:58

Performance of map pattern in multithreaded program lower than expected (4x speedup vs 8x)

1 Answers1