How to use pthread in C to count the number of word occurrences?

Question

I have written a program in C to count all the word occurrences of each word in a file and sort them to display the most occurring words to the least occurring words. However, I need to use pthread to create multiple threads, depending on the number entered as an argument in the command line. The file needs to be split up into the number of threads entered. For example, say 4 was entered in the command line as an argument. The file would then need to be split up into four parts with each part using a new thread. Then the four parts would need to be joined back together. My C is not very good and I am lost on how to do this. Can anyone please help with this? An example would be great.

Here is my code so far:

    int main(int argc, char **argv) {
       struct stat fileStat;
       FILE *out;
       char *address;
       int size, res, file, num_threads;

       list_t *words = (list_t *)malloc(sizeof(list_t)); 

       res = access(argv[1], F_OK);
       if (result != 0) {
          exit(1);
       }

       stat(argv[1], &fileStat);

       // Check if a file.
       if (S_ISREG(fileStat.st_mode)) {
          file = open(argv[1], O_RDONLY);
          if (file < 0)
             exit(1);
         // Check the total size of the file
         size = fileStat.st_size; 
         num_threads = atoi(argv[2]); 
         if ((addr = mmap(0, size, PROT_READ, MAP_SHARED , file, 0)) == (void *) -1) {
            exit(1);
         }
         munmap(addr, size);
         close(file);
      } else {
         exit(1);
      }

Should this be also tagged as `linux`? If you are splitting a text file into four parts, you'll need to do some pre-processing so as not to split a word. I don't understand what you mean by *"the four parts would need to be joined back together"*. Do you mean the results? Wouldn't it be better to assign each thread a different section of the alphabet (say a-f, g-m, n-s, t-z)? — Weather Vane, Dec 05 '14 at 15:41

score 1 · Answer 1 · answered Dec 05 '14 at 21:20

1

Multiple threads can safely read a source file without issue. Writing is when you have problems.

My suggestion (without really understanding the requirements) is:

On launch determine the file size
calculate value size/threadcount
lets say the file is 4k in size, you get a value of about 1k per thread
seek into the file 1 chunk size, read single bytes until you find a word separator
This position is the end of thread 1's zone and the start of thread 2
Seek to the second and 3rd chunk sizes and do the same
At this point you have a file start and finish position for each thread
Launch each thread and pass them the positions they will be responsible for covering
Make the hashtable (or whatever method for counting words you are using) thread safe with mutual exclusion techniques and just have each thread add to the count of whatever word is found
Once all threads are done you have your list

answered Dec 05 '14 at 21:20

RobC

502
4
17

Would I need to use the mmap function? I know how to get the total size of the file. I'm not sure how to implement from there. – Harley Jones Dec 06 '14 at 23:54
So say I want it to use 4 threads. I would get the chunk_size by dividing the file_size by the number of threads. Then use a for loop to loop through the number of threads, each time multiplying chunk_size by i. Then say lseek(file, chunk_size, SEEK_CUR);. Would this be right so far? Now I would create a new thread after that? – Harley Jones Dec 07 '14 at 04:17
mmap is easier than manual file access, you can access it like a buffer in memory. If you break it down into its component parts it would be easier than you think. Step one start off with a simple program that mmaps the file. Then calc the start and stop positions for each thread. Then launch the threads pass them a struct with two numbers, start and stop. Each thread is just a simple program that starts at buf[start] and ends at buf[stop]. Break it down into tiny projects that are easy one by one. Semantics of C will probably be the big issue if you aren't proficient yet. – RobC Dec 07 '14 at 08:34
Yeah, my C isn't very good. So I have the mmap in my code above. How would I get the start and stop positions with that? It would need to work for any number of threads. – Harley Jones Dec 07 '14 at 17:27
@HarleyJones as I mentioned above, it's all very well chopping a file into four parts and using `lseek()`, but you'll need a strategy for handling words that straddle each boundary. For example if the word "notice" is broken in its middle, you'll need a way of preventing "not" and "ice" being reported, while "notice" is overlooked. – Weather Vane Dec 07 '14 at 18:04

Governa · Accepted Answer · 2014-12-11T17:19:26.060

The idea here is that dividing the work in multiple threads and joining the parts afterwards, it is much faster to do the same operation. So you need to:

Divide the work in many parts without wasting too much time
Discover a way of joining back the work easily
Solve the problem in the boundaries caused by dividing the work

The first part is simple. Just divide your data equally between threads.

The second part is easy too. Just sum the results.

The tricky part is part number 3. In your case, you could end up with a word divided between two different threads. So to avoid counting "half-words" you must maintain an separate record for the first/last word of every thread. Then, when you have all your results, you can get the last word of thread N and concatenate it with the first word of thread N+1 and only then add the word to the count. Obviusly if an separator(space, enter, ...) is the first/last char found by a thread, your respective first/last word will be empty.

In pseudo-code:

def main:
    size = filesize
    ptr = mmap(file)
    num_threads = 4

    for i in range(1, num_threads):
        new_thread(exec = count_words,
                   start = ptr + i * size / num_threads,
                   length = size / num_threads)

    wait_for_result(all_threads)
    join_the_results

def count_words(start, length):
    # Count words as if it was an entire file
    # But store separatelly the first/last word if
    # the segment does not start/ends with an word
    # separator(" ", ".", ",", "\n", etc...)
    return (count_of_words, first_word, last_word)

This is the same idea behind MapReduce.

score 0 · Answer 3 · edited May 23 '17 at 12:20

This is not a perfect-logic code. I have used C++. If you are very particular with C, you can use POSIX threads instead of std::thread. Also I have just divided the whole file size into number of threads. You will have to take care of last chunk of data(remaining from divide by number of threads), in the last thread itself. I haven't done this.

Another point is the way I am getting the return values from the threads. As of now I am saving it to a global array. C++11 supports retrieving return values - C++: Simple return value from std::thread?

#include <iostream>
#include <fstream>
#include <thread>
#include <mutex>
using namespace std;

#define NO_OF_THREADS 4

int countArray[100];
std::mutex g_pages_mutex;

int trimWhiteSpaces(char *arr, int start, int len)
{
    int i = 0;
    for(; i < len; i++)
    {
        char c = arr[i];
        if(c == ' ')
        {
            continue;
        }
        else
            break;
    }

    return i;
}

void getWordCount(char *arr, int len, int ID)
{

    int count = 0;
    bool isSpace = false;
    int i = 0;
    i = i + trimWhiteSpaces(arr, i, len);
    for(; i < len; i++)
    {
        char c = arr[i];
        if(c == ' ')
        {
            i = i + trimWhiteSpaces(&arr[i], i, len) - 1;
            //printf("Found space");
            isSpace = true;
            count++;
        }
        else
        {
            isSpace = false;
        }
    }

    if(isSpace)
        count = count - 1;
    count = count + 1;
    g_pages_mutex.lock();
    cout << "MYCOUNT:" << count << "\n";
    countArray[ID] = count;
    g_pages_mutex.unlock();
}

int main(int argc, const char * argv[])
{
    char fileData[5000];
    std::thread threadIDs[100];
    int noOfThreads = NO_OF_THREADS;
    char *filePath = "/Users/abc/Desktop/test.txt";
    int read_sz = 0;
    int decrements = 0;
    bool previousNotEndsInSpace = false;

    std::ifstream is(filePath, std::ifstream::ate | std::ifstream::binary);
    int fileSize = is.tellg();
    int bulkSize = fileSize / NO_OF_THREADS;
    is.seekg(0);


    for(int iter = 0; iter < NO_OF_THREADS; iter++)
    {
        int old_read_sz = read_sz;
        is.read(fileData, bulkSize);
        read_sz = is.tellg();
        fileData[read_sz - old_read_sz] = '\0';
        if(read_sz > 0)
        {
            cout << " data size so far: " << read_sz << "\n";
            cout << fileData << endl;
            if(previousNotEndsInSpace && fileData[0] != ' ')
            {
                decrements = decrements + 1;
            }

            if(fileData[read_sz - 1] != ' ')
            {
                previousNotEndsInSpace = true;
            }
            else
            {
                previousNotEndsInSpace = false;
            }
            //getWordCount(fileData, strlen(fileData), iter);
            threadIDs[iter] = std::thread(getWordCount, fileData, strlen(fileData), iter);
        }
    }

    for(int iter = 0; iter < NO_OF_THREADS; iter++)
    {
        threadIDs[iter].join();
    }

    int totalCount = 0;
    for(int iter = 0; iter < NO_OF_THREADS; iter++)
    {
        cout << "COUNT: " << countArray[iter] << "\n";
        totalCount = totalCount + countArray[iter];
    }

    cout << "TOTAL: " << totalCount - decrements << "\n";
    return 0;
}

Wasn't a C solution requested? – Weather Vane Dec 11 '14 at 18:16 — Weather Vane, Dec 11 '14 at 18:16

How to use pthread in C to count the number of word occurrences?

3 Answers3