How to apply Multi-threading/processing to run a function with input broken to chunks?

Question

How do I run a function to break the text input as chunks and run it simultaneously?

INPUT
s = "ABCDEFGHIJKABCLMNOPQRSTUVABCSDLSFIJKKJJKLOP"

getSubstringCount(s, 3)

OUTPUT
{'ABC': 3, 'BCD': 1, 'CDE': 1, 'DEF': 1, ...and so on}

How do I run this function in a multithreaded way? Such that the string is broken down to chunks and passed onto the getSubstringCount?

Something like:

['ABCDEFGHIJ', 'KABCLMNOPQ', 'RSTUVABCSD', 'LSFIJKKJJKLOP']

And then they get passed to the function in a loop somehow? Any ideas on how to proceed with this?

what is the logic in which you want to chunk them to sub arrays? seems your problem is to count the number of occurrences of each substring of given length. see this https://stackoverflow.com/questions/15726641/find-all-possible-substring-in-fastest-way — Anand Sowmithiran, Jan 28 '22 at 12:44
Does this answer your question? [Find all possible substring in fastest way](https://stackoverflow.com/questions/15726641/find-all-possible-substring-in-fastest-way) — Anand Sowmithiran, Jan 28 '22 at 12:45
std::thread was added in c++11. You can likely also look into operating system specific means if necessary. — Abel, Jan 28 '22 at 12:52
@AnandSowmithiran the logic is shown above as I want to break the input string to be broken into smaller strings and passed through the function. But how do I run this in a multithreaded fashion? — Parzival, Jan 28 '22 at 12:58
when you break them into smaller strings, you will get incorrect results anyways. For e.g. if one substring ended with `...AB`, and another starts as `'CH..'`, then your count for number of times `ABC` present will be incorrect. — Anand Sowmithiran, Jan 28 '22 at 13:00
If you break it down like you've shown, how would you be able to count occurrences of substrings like `IJK` ? — Ted Lyngmo, Jan 28 '22 at 13:01
To observe a potential gain of performance, the string must be huge, a few megabytes. Otherwise, the time to start the threads may be (orders of magnitude) slower than the processing. — prapin, Jan 28 '22 at 13:44
@prapin yes true, I want to use this for very large files only ( a few GBs ). Is there any way? — Parzival, Jan 28 '22 at 14:25
@AnandSowmithiran and Ted, it is true some may not work but I'm choosing to ignore those cases...do you see a solution without counting that? — Parzival, Jan 28 '22 at 14:27
For gigabytes of data, it definitely makes sense to paralyze. I will not write the code for you, but here are some clues. Mount the file in memory using `mmap` or `CreateFileMapping` on Windows. Divide the length of the file by the number of CPU cores N. Start N `std::threads`, passing as argument to each the start and end offsets to process. Each thread will read its part of memory mapped string and make its own histogram. At the end, `main` merges the histograms (summing). That way, it will be performant: no mutex, no false sharing. — prapin, Jan 28 '22 at 14:57
Which of the steps are you having trouble with? You have the function already, right? So you just need to run it in a `std::thread`. Have you done that? And then create multiple `std::thread` objects? And do you know how to split up your input? You can't be stuck on all the steps if you haven't attempted them yet. — Useless, Jan 28 '22 at 15:48

score 0 · Answer 1 · answered Jan 28 '22 at 15:16

If u want print or store something like this {'ABC': 3, 'BCD': 1, 'CDE': 1, 'DEF': 1, ...and so on} u need O(n * k) memory and time. Even if u can split it into chunks and run parallel and handle corner cases around splittings, you need merge that dictionaries and i dont' know how to parallel this so it take at least O(n) and this cant be O(n / threads) anyway (my opinion). Best i can offer to u it's this:

#include <iostream>
#include <fstream>
#include <unordered_set>
#include <unordered_map>
#include <cassert>
#include <vector>
#include <thread>

const long long MOD = 1e9 + 7; // prime number
const long long P = 269;       // polynomial hash base

int calculateCustomHash(std::string::const_iterator beg, int k) // O(k)
{
    long long res = 0;
    for (int i = 0; i < k; ++i, ++beg)
    {
        res *= P;
        res += *beg;
        res %= MOD;
    }
    res += (res < 0) * MOD;
    return res;
}

void format(std::string::const_iterator beg, int k, int value, std::string &to_add)
{
    std::string res;
    to_add.reserve(to_add.size() + k + 12);
    to_add.push_back('\'');
    std::copy(beg, beg + k, std::back_inserter(to_add));
    to_add += "': ";
    to_add += std::to_string(value);
    to_add += ", ";
}

struct SubstringCount // mem : O(s.size()) time : O(s.size()), does not depend on k
{
    std::string _s;
    int _k;
    std::unordered_map<int, int> _res{};
    std::vector<int> hashes{};
    SubstringCount(std::string::const_iterator beg = {}, std::string::const_iterator end = {}, int k = 0) : _k(k) // O(end - beg)
    {
        if (k == 0)
            return;
        _s.reserve(end - beg);
        std::copy(beg, end, std::back_inserter(_s));
        if (end - beg < k)
        {
            return;
        }
        hashes.reserve(end - beg - k + 1);
        long long p_pow_k = 1;
        long long current_hash = 0;
        std::string::const_iterator it = beg;
        for (int i = 0; i < k; ++i, ++it)
        {
            current_hash *= P;
            current_hash += *it;
            current_hash %= MOD;
            p_pow_k *= P;
            p_pow_k %= MOD;
        }
        current_hash += (current_hash < 0) * MOD;
        hashes.push_back(current_hash);
        ++_res[current_hash];
        for (; it != end; ++it, ++beg)
        {
            current_hash *= P;
            current_hash += *it;
            current_hash -= p_pow_k * *beg;
            current_hash %= MOD;
            current_hash += (current_hash < 0) * MOD;
            hashes.push_back(current_hash);
            ++_res[current_hash];
        }
    }

    std::string decode() // O((end - beg) * k)
    {
        std::string::const_iterator beg = _s.begin();
        std::string::const_iterator end = _s.end();
        if (end - beg < _k)
        {
            return {};
        }
        std::string res = "{";
        std::unordered_set<int> printed;
        long long p_pow_k = 1;
        long long current_hash = 0;
        std::string::const_iterator it = beg;
        for (int i = 0; i < _k; ++i, ++it)
        {
            current_hash *= P;
            current_hash += *it;
            current_hash %= MOD;
            p_pow_k *= P;
            p_pow_k %= MOD;
        }
        current_hash += (current_hash < 0) * MOD;
        if (printed.insert(current_hash).second)
        {
            format(beg, _k, _res.at(current_hash), res);
        }
        for (; it != end; ++it, ++beg)
        {
            current_hash *= P;
            current_hash += *it;
            current_hash -= p_pow_k * *beg;
            current_hash %= MOD;
            current_hash += (current_hash < 0) * MOD;
            if (printed.insert(current_hash).second)
            {
                format(beg + 1, _k, _res.at(current_hash), res);
            }
        }
        res += "}";
        return res;
    }

    int getCount(const std::string &s) // O(k)
    {
        assert(s.size() == _k);
        int hash = calculateCustomHash(s.begin(), _k);
        if (_res.count(hash))
        {
            return _res.at(hash);
        }
        return 0;
    }

    int getCount(int ind)
    {
        if (ind < 0)
            return 0;
        if (ind >= hashes.size())
            return 0;
        return _res.at(hashes[ind]);
    }
};

int main()
{
    int k = 3;
    std::string s = "ABCDEFGHIJKABCLMNOPQRSTUVABCSDLSFIJKKJJKLOP";
    SubstringCount subscnt(s.begin(), s.end(), k); // mem : O(s.size()) time : O(s.size()), does not depend on k
    std::cout << subscnt.decode() << "\n";         // O(s.size() * k)
    // accesing to any substring count O(k) each
    std::string ijk;
    format(std::string("IJK").begin(),
           k,
           subscnt.getCount("IJK"), // O(k)
           ijk);
    std::cout << ijk << "\n";
    ijk.clear();
    format(s.begin() + 8,
           k,
           subscnt.getCount(8), // O(1)
           ijk);
    std::cout << ijk << "\n";

    // multi-threading ???
    const int threads_num = 8;
    int piece_size = s.size() / threads_num + 1;
    SubstringCount subscnts[threads_num];
    std::thread threads[threads_num];
    for (int i = 0; i < threads_num; ++i)
    {
        threads[i] = std::thread([piece_size, i, &s, &subscnts, k]()
                                 {
        std::string piece = s.substr(piece_size * i, piece_size);
        subscnts[i] = SubstringCount(piece.begin(), piece.end(), k); }); // O(piece_size)
    }
    for (int i = 0; i < threads_num; ++i){
        threads[i].join();
    }
    std::cout << "\n";
    // u cant write from diffrent threads (or u can get mess of symbols overlaping eachother) so u must join them before and then print:
    for (int i = 0 ; i < threads_num; ++i){
        std::cout << subscnts[i].decode() << "\n"; // O(piece_size * k)
    }
}

not multi-threaded but works O(n) time and O(n) mem (i don't think it can be better even with multi-threading)

-I want to use this for very large files only ( a few GBs ). Is there any way? do you have enough RAM to store this string? if no u have bigger problems than multi-threading

- it is true some may not work but I'm choosing to ignore those cases...do you see a solution without counting that? if u ignore this cases just split string and run in single thread and not merge resulting dict u can do like code above or u can do shared unordered_map protected by mutex and update it from multiple threads by batches about 64 records for one mutex lock

How to apply Multi-threading/processing to run a function with input broken to chunks?

1 Answers1