How to count words in huge file by spliting data?

Question

I try to count word in huge file. I want to use max of CPU resources and i try to split input data and count words in threads. But i have a problem, when i split data it can split the words and in the end i have wrong answer. How can i split data from file to avoid spliting words? Can somebody help me?

#include <iostream>
#include <fstream>
#include <set>
#include <string>
#include <thread>
#include <mutex>
#include <sstream>
#include <vector>
#include <algorithm>
#define BUFER_SIZE  1024

using namespace std;
std::mutex mtx;

void worker(int n, set<std::string> &mySet, std::string path)
{
    mtx.lock();
    ifstream file (path, ios::in);
    if (file.is_open())
    {
        char *memblock = new char [BUFER_SIZE];
        file.seekg (n * (BUFER_SIZE - 1), ios::beg);
        file.read(memblock, BUFER_SIZE - 1);
        std::string blockString(memblock);
        std::string buf;
        stringstream stream(blockString);
        while(stream >> buf) mySet.insert(buf);
        memblock[BUFER_SIZE] = '\0';
        file.close();
        delete[] memblock;
    }
    else 
        cout << "Unable to open file";
    mtx.unlock();
}

int main(int argc, char *argv[])
{
    set<std::string> uniqWords;
    int threadCount = 0;
    ifstream file(argv[1], ios::in);
    if(!file){
        std::cout << "Bad path.\n";
        return 1;
    }
    file.seekg(0, ios::end);
    int fileSize = file.tellg();
    file.close();
    std::cout << "Size of the file is" << " " << fileSize << " " << "bytes\n";
    threadCount = fileSize/BUFER_SIZE + 1;
    std::cout << "Thread count: " << threadCount << std::endl;
    std::vector<std::thread> vec;
    for(int i=0; i < threadCount; i++)
    {
        vec.push_back(std::thread(worker, i, std::ref(uniqWords), argv[1]));
    }
    std::for_each(vec.begin(), vec.end(), [](std::thread& th)
    {
        th.join();
    });
    std::cout << "Count: " << uniqWords.size() << std::endl;
    return 0;
}

Note: [`std::set` and others are not thread-safe](https://stackoverflow.com/q/1362110/1270789). — Ken Y-N, Nov 10 '21 at 07:07
I suggest read a chunk of data into a buffer, give that buffer to a worker thread for processing. Repeat this until end of file. To divide into almost equal chunks, you may need to get the file size first and then divide based on number of threads. — kiner_shah, Nov 10 '21 at 07:09
@kiner_shah Oops missed that, but that's locking the whole thread so it makes them run in serial. One solution is to use one set per thread and merge afterwards. — Ken Y-N, Nov 10 '21 at 07:13
1. You cannot read data in parallel. You access one resource. 2. The slowest part of processing is reading, so running more than one thread makes no sense. 3. If this is just an exercise, the read part would be a producer thread. The counting part would be a number of consumers. — zdf, Nov 10 '21 at 07:18
@RichardCritten And all his threads are trying to modify the same single-threaded collection, so again no point in using threads. — David Schwartz, Nov 10 '21 at 07:27
If you lock the whole thread like that, I don't think you used the CPU efficiently. Executing multiple threads one by one is just like one single thread program, right? — Yves, Nov 10 '21 at 07:38

score 2 · Accepted Answer · answered Nov 10 '21 at 07:21

The problem right now is that you're reading in a fixed-size chunk, and processing it. If that stops mid-word, you're doing to count the word twice, once in each buffer it was placed into.

One obvious solution is to only break between buffers at word boundaries--e.g., when you read in one chunk, look backwards from the end to find a word boundary, then have the thread only process up to that boundary.

Another solution that's a bit less obvious (but may have the potential for better performance) is to look at the last character in a chunk to see if it's a word character (e.g., a letter) or a boundary character (e.g., a space). Then when you create and process the next chunk, you tell it whether the previous buffer ended with a boundary or within a word. If it ended within a word, the counting thread knows to ignore the first partial word in the buffer.

How to count words in huge file by spliting data?

1 Answers1