Concurrent Threads Slower than Single Thread

Question

I've been comparing two ways of finding the highest value from a matrix (should they be duplicated, randomly choose between them), single threaded vs multi-threaded. Typically, the multi-thread should be faster, assuming I coded this properly. Because it is not, it is by far slower, I can only assume I did something wrong. Can someone pinpoint what I did wrong?

Note: I know I shouldn't use rand(), but for this purpose I feel there aren't that many problems with doing so, I'll replace it with a mt19937_64 after this is working properly.

Thanks in advance!

double* RLPolicy::GetActionWithMaxQ(std::tuple<int, double*, int*, int*, int, double*>* state, long* selectedActionIndex, bool& isActionQZero)
{
    const bool useMultithreading = true;

    double* qIterator = Utilities::DiscretizeStateActionPairToStart(_world->GetCurrentStatePointer(), (long*)&(std::get<0>(*state)));

    // Represents the action-pointer for which Q-values are duplicated
    // Note: A shared_ptr is used instead of a unique_ptr since C++11 wont support unique_ptrs for pointers to pointers **
    static std::shared_ptr<double*> duplicatedQValues(new double*[*_world->GetActionsNumber()], std::default_delete<double*>());
    /*[](double** obj) {
    delete[] obj;
    });*/

    static double* const defaultAction = _actionsListing.get();// [0];
    double* actionOut = defaultAction; //default action
    static double** const duplicatedQsDefault = duplicatedQValues.get();

    if (!useMultithreading)
    {    
        const double* const qSectionEnd = qIterator + *_world->GetActionsNumber() - 1;

        double* largestValue = qIterator;
        int currentActionIterator = 0;

        long duplicatedIndex = -1;

        do {
            if (*qIterator > *largestValue)
            {
                largestValue = qIterator;
                actionOut = defaultAction + currentActionIterator;
                *selectedActionIndex = currentActionIterator;
                duplicatedIndex = -1;
            }
            // duplicated value, map it
            else if (*qIterator == *largestValue)
            {
                ++duplicatedIndex;
                *(duplicatedQsDefault + duplicatedIndex) = defaultAction + currentActionIterator;
            }
            ++currentActionIterator;
            ++qIterator;
        } while (qIterator != qSectionEnd);

        // If duped (equal) values are found, select among them randomly with equal probability
        if (duplicatedIndex >= 0)
        {
            *selectedActionIndex = (std::rand() % duplicatedIndex);
            actionOut = *(duplicatedQsDefault + *selectedActionIndex);
        }

        isActionQZero = *largestValue == 0;

        return actionOut;

    }
    else
    {
        static const long numberOfSections = 6;
        unsigned int actionsPerSection = *_world->GetActionsNumber() / numberOfSections;
        unsigned long currentSectionStart = 0;

        static double* actionsListing = _actionsListing.get();

        long currentFoundResult = FindActionWithMaxQInMatrixSection(qIterator, 0, actionsPerSection, duplicatedQsDefault, actionsListing);

        static std::vector<std::future<long>> maxActions;
        for (int i(0); i < numberOfSections - 1; ++i)
        {
            currentSectionStart += actionsPerSection;
            maxActions.push_back(std::async(&RLPolicy::FindActionWithMaxQInMatrixSection, std::ref(qIterator), currentSectionStart, std::ref(actionsPerSection), std::ref(duplicatedQsDefault), actionsListing));
        }

        long foundActionIndex;

        actionOut = actionsListing + currentFoundResult;

        for (auto &f : maxActions)
        {
            f.wait();

            foundActionIndex = f.get();

            if (actionOut == nullptr)
                actionOut = defaultAction;
            else if (*(actionsListing + foundActionIndex) > *actionOut)
                actionOut = actionsListing + foundActionIndex;
        }

        maxActions.clear();

        return actionOut;
    }
}

/*
    Deploy a thread to find the action with the highest Q-value for the provided Q-Matrix section.

    @return - The index of the action (on _actionListing) which contains the highest Q-value.
*/
long RLPolicy::FindActionWithMaxQInMatrixSection(double* qMatrix, long sectionStart, long sectionLength, double** dupListing, double* actionListing)
{
    double* const matrixSectionStart = qMatrix + sectionStart;
    double* const matrixSectionEnd = matrixSectionStart + sectionLength;
    double** duplicatedSectionStart = dupListing + sectionLength;

    static double* const defaultAction = actionListing;
    long maxValue = sectionLength;
    long maxActionIndex = 0;
    double* qIterator = matrixSectionStart;
    double* largestValue = matrixSectionStart;

    long currentActionIterator = 0;

    long duplicatedIndex = -1;

    do {
        if (*qIterator > *largestValue)
        {
            largestValue = qIterator;
            maxActionIndex = currentActionIterator;
            duplicatedIndex = -1;
        }
        // duplicated value, map it
        else if (*qIterator == *largestValue)
        {
            ++duplicatedIndex;
            *(duplicatedSectionStart + duplicatedIndex) = defaultAction + currentActionIterator;
        }
        ++currentActionIterator;
        ++qIterator;
    } while (qIterator != matrixSectionEnd);

    // If duped (equal) values are found, select among them randomly with equal probability
    if (duplicatedIndex >= 0)
    {
        maxActionIndex = (std::rand() % duplicatedIndex);
    }

    return maxActionIndex;
}

@AshokBhaskar - This test PC only has 2 cores and 4 threads (Intel® Core™ i5-6300U Processor), my main pc has 4 cores. I was considering running this on AWS to be faster, so I suppose it will have more cores than this laptop. — JDoe, Apr 18 '18 at 17:03
@SomeWittyUsername - I am no coder by training, so... do you mind explaining whats wrong? Anyway not exactly focusing on the question. — JDoe, Apr 18 '18 at 17:07
Unless you are actually assigning the tasks to different cores, you are just running more code on the same code, which will not be faster.. it will be slower. — David Hoelzer, Apr 18 '18 at 17:08
@DavidHoelzer - Is there a programatic mode to find how many available cores there are? — JDoe, Apr 18 '18 at 17:09
@DavidHowlzer - Just found it! https://stackoverflow.com/questions/150355/programmatically-find-the-number-of-cores-on-a-machine still you see nothing wrong with the code? E.g. threads trying to run on the same variable. Again, I am quite newbie on these subjects.. — JDoe, Apr 18 '18 at 17:11
@JDoe How much slower is the multithreaded version? I'm not sure if your specific implementation is buggy, but parallel code can be slower than single-threaded code if the overhead of parallel code is too large, e.g. data copying time or threads competing for the memory bus. — Ashok Bhaskar, Apr 18 '18 at 17:12
@SomeWittyUsername int i(0) is fine, but I think uniform initialization syntax (i.e. with braces, int i{0}) is the way to go; take a look at https://softwareengineering.stackexchange.com/questions/133688/is-c11-uniform-initialization-a-replacement-for-the-old-style-syntax?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa — Ashok Bhaskar, Apr 18 '18 at 17:22
Are you sure the code is actually running in parallel? `std::async` defaults to a launch policy that doesn't enforce using threads, and may only be doing lazy evaluation of the function. You can use the `std::async(std::launch::async, ...)` overload to ensure multiple threads are used. — bnaecker, Apr 18 '18 at 17:22
Single thread is faster than multi thread because each thread has its own context and OS switch from one context to another and this means that OS must store the status of each context every time that a thread is waiting and recover when it is reactivated. Also consider the size of your data and the algorithm that you are using, maybe Multithreading could be faster than single thread when it has an huge amount of data, check this link about Asymptotic Complexity: https://www.cs.cornell.edu/courses/cs3110/2013sp/supplemental/lectures/lec19-asymp/review.html — Jorge Omar Medra, Apr 18 '18 at 17:26
@AshokBhaskar - In microseconds, single-thread takes 2 on average, multi-thread takes 500-ish. — JDoe, Apr 18 '18 at 17:38
@bnacker - enforced your version thank you! Still, same result. — JDoe, Apr 18 '18 at 17:38
2 microseconds vs 500 microseconds sounds like the cost of parallel overhead to me - can you generate a larger test case and see how the two algorithms compare? — Ashok Bhaskar, Apr 18 '18 at 18:07

score 1 · Answer 1 · answered Apr 18 '18 at 19:53

Parallel programs are not necessarily faster than serial programs; there are both fixed and variable time costs to setting up parallel algorithms, and for small and/or simple problems, this parallel overhead cost can be greater than the cost of the serial algorithm as a whole. Examples of parallel overhead include thread spawning and synchronization, extra memory copying, and memory bus pressure. At around 2 mircoseconds for your serial program and around 500 microseconds for your parallel program, it's likely your matrix is small enough that the work of setting up the parallel algorithm overshadows the work of solving the matrix problem.

Concurrent Threads Slower than Single Thread

1 Answers1