1

I've got 10000+ text files on disk. My task is, to find most common 3-word sequence of words that appear in those files. I read file word by word and increment global std::map<std::string, int> for each sequence in each file. Finally, I sort the map and pick the most common ones. I managed to write code for it, but I read that I can increase the speed by reading each file in another thread.

My first thought was to run as many threads as there are files but my program slowed down from 6s to 80s.

My second thought was to make some kind of thread-pool that runs for example 20 threads, waits for them to finish (.join()) and start next 20 threads (over and over again). That made my program run time improved from 6s to 4s.

But that's not the fastest way, since main thread waits till all 20 threads finish their job, we've got space for those threads that finished working as 1-19th.

My question is, how do I implement such a thread pool that starts working on next file as soon as it finishes working on previous one?

Code for my second thought:

std::vector<std::thread> threads;

char byThreadPool = 20;
int nFileCount = 10495;

for (int i = 0; i < nFileCount; i += byThreadPool)
{
    for (int j = i; j < i+byThreadPool && j < nFileCount; j++)
    {
        std::string fileName = path + std::to_string(j) + PAGE_EXTENSION;
        threads.push_back(std::thread(&CWordParserFileSystem::FetchFile, this, fileName));
    }

    for (int j = 0; j < threads.size(); j++)
        threads[j].join();

    threads.clear();
}
  • 1
    Make each thread fetch and process tasks from a queue until it's empty. You don't have to terminate and restart threads. If for some reason you do want that, you want to use a condition variable where a thread can signal that it's about to finish. Then the main prohram can join that specific thread and launch a new one. – n. m. could be an AI Jul 02 '16 at 18:47
  • You may build a small pipeline: An input queue of files -> { N threads grabbing a file (until there is none), creating a map of words for that file only and put that map to an output queue} -> merge all maps of words in the output queue. –  Jul 02 '16 at 18:48
  • Thread (1): push jobs onto a queue (in this case filenames). Thread (2 to N) pull a job of the queue and process it, repeat until the queue is empty. Increase N and measure the performance. Threads (2 to N) never exit until thread(1) indicates it is finished AND the work queue is empty. Then you have questions like is it faster with each thread having its own std::map (and merging when complete) or competing for a master map? – Richard Critten Jul 02 '16 at 18:49
  • 7
    ... on the other hand, you will not get any significant improvement: the bottle neck is the hard drive (I/O) –  Jul 02 '16 at 18:57
  • how about using boost thread pool. http://stackoverflow.com/questions/19500404/how-to-create-a-thread-pool-using-boost-in-c – Rahul Menon Jul 02 '16 at 19:11
  • Correct me if I am wrong, but if you are using 4 core processor with hyperthreading then you can do up to 8 true parallel threads? Adding more threads is good if you want things to seem to run in parallel, not necessarily to improve speed. A special case would be if you use threads to trigger requests which require long response time, such as talking to ports. – Makketronix Jul 03 '16 at 04:41
  • Just tested it and @Makketronix, you seem to be right. With 1 thread I get `8s` result, 2 threads about `4.5s`, 8 threads `3.7s` and for 8+ it stays `3.5-3.7s`. Thanks, it works. – Damian Kalinowski Jul 03 '16 at 09:22

0 Answers0