0

I have written a program in C to count all the word occurrences of each word in a file and sort them to display the most occurring words to the least occurring words. However, I need to use pthread to create multiple threads, depending on the number entered as an argument in the command line. The file needs to be split up into the number of threads entered. For example, say 4 was entered in the command line as an argument. The file would then need to be split up into four parts with each part using a new thread. Then the four parts would need to be joined back together. My C is not very good and I am lost on how to do this. Can anyone please help with this? An example would be great.

Here is my code so far:

    int main(int argc, char **argv) {
       struct stat fileStat;
       FILE *out;
       char *address;
       int size, res, file, num_threads;

       list_t *words = (list_t *)malloc(sizeof(list_t)); 

       res = access(argv[1], F_OK);
       if (result != 0) {
          exit(1);
       }

       stat(argv[1], &fileStat);

       // Check if a file.
       if (S_ISREG(fileStat.st_mode)) {
          file = open(argv[1], O_RDONLY);
          if (file < 0)
             exit(1);
         // Check the total size of the file
         size = fileStat.st_size; 
         num_threads = atoi(argv[2]); 
         if ((addr = mmap(0, size, PROT_READ, MAP_SHARED , file, 0)) == (void *) -1) {
            exit(1);
         }
         munmap(addr, size);
         close(file);
      } else {
         exit(1);
      }
Harley Jones
  • 167
  • 6
  • 20
  • 1
    Should this be also tagged as `linux`? If you are splitting a text file into four parts, you'll need to do some pre-processing so as not to split a word. I don't understand what you mean by *"the four parts would need to be joined back together"*. Do you mean the results? Wouldn't it be better to assign each thread a different section of the alphabet (say a-f, g-m, n-s, t-z)? – Weather Vane Dec 05 '14 at 15:41
  • Yes, the results would need to be joined back together. – Harley Jones Dec 06 '14 at 01:28

3 Answers3

1

Multiple threads can safely read a source file without issue. Writing is when you have problems.

My suggestion (without really understanding the requirements) is:

  • On launch determine the file size
  • calculate value size/threadcount
  • lets say the file is 4k in size, you get a value of about 1k per thread
  • seek into the file 1 chunk size, read single bytes until you find a word separator
  • This position is the end of thread 1's zone and the start of thread 2
  • Seek to the second and 3rd chunk sizes and do the same
  • At this point you have a file start and finish position for each thread
  • Launch each thread and pass them the positions they will be responsible for covering
  • Make the hashtable (or whatever method for counting words you are using) thread safe with mutual exclusion techniques and just have each thread add to the count of whatever word is found
  • Once all threads are done you have your list
RobC
  • 502
  • 4
  • 17
  • Would I need to use the mmap function? I know how to get the total size of the file. I'm not sure how to implement from there. – Harley Jones Dec 06 '14 at 23:54
  • So say I want it to use 4 threads. I would get the chunk_size by dividing the file_size by the number of threads. Then use a for loop to loop through the number of threads, each time multiplying chunk_size by i. Then say lseek(file, chunk_size, SEEK_CUR);. Would this be right so far? Now I would create a new thread after that? – Harley Jones Dec 07 '14 at 04:17
  • mmap is easier than manual file access, you can access it like a buffer in memory. If you break it down into its component parts it would be easier than you think. Step one start off with a simple program that mmaps the file. Then calc the start and stop positions for each thread. Then launch the threads pass them a struct with two numbers, start and stop. Each thread is just a simple program that starts at buf[start] and ends at buf[stop]. Break it down into tiny projects that are easy one by one. Semantics of C will probably be the big issue if you aren't proficient yet. – RobC Dec 07 '14 at 08:34
  • Yeah, my C isn't very good. So I have the mmap in my code above. How would I get the start and stop positions with that? It would need to work for any number of threads. – Harley Jones Dec 07 '14 at 17:27
  • @HarleyJones as I mentioned above, it's all very well chopping a file into four parts and using `lseek()`, but you'll need a strategy for handling words that straddle each boundary. For example if the word "notice" is broken in its middle, you'll need a way of preventing "not" and "ice" being reported, while "notice" is overlooked. – Weather Vane Dec 07 '14 at 18:04
0

The idea here is that dividing the work in multiple threads and joining the parts afterwards, it is much faster to do the same operation. So you need to:

  1. Divide the work in many parts without wasting too much time
  2. Discover a way of joining back the work easily
  3. Solve the problem in the boundaries caused by dividing the work

The first part is simple. Just divide your data equally between threads.

The second part is easy too. Just sum the results.

The tricky part is part number 3. In your case, you could end up with a word divided between two different threads. So to avoid counting "half-words" you must maintain an separate record for the first/last word of every thread. Then, when you have all your results, you can get the last word of thread N and concatenate it with the first word of thread N+1 and only then add the word to the count. Obviusly if an separator(space, enter, ...) is the first/last char found by a thread, your respective first/last word will be empty.

In pseudo-code:

def main:
    size = filesize
    ptr = mmap(file)
    num_threads = 4

    for i in range(1, num_threads):
        new_thread(exec = count_words,
                   start = ptr + i * size / num_threads,
                   length = size / num_threads)

    wait_for_result(all_threads)
    join_the_results

def count_words(start, length):
    # Count words as if it was an entire file
    # But store separatelly the first/last word if
    # the segment does not start/ends with an word
    # separator(" ", ".", ",", "\n", etc...)
    return (count_of_words, first_word, last_word)

This is the same idea behind MapReduce.

Governa
  • 908
  • 10
  • 21
0

This is not a perfect-logic code. I have used C++. If you are very particular with C, you can use POSIX threads instead of std::thread. Also I have just divided the whole file size into number of threads. You will have to take care of last chunk of data(remaining from divide by number of threads), in the last thread itself. I haven't done this.

Another point is the way I am getting the return values from the threads. As of now I am saving it to a global array. C++11 supports retrieving return values - C++: Simple return value from std::thread?

#include <iostream>
#include <fstream>
#include <thread>
#include <mutex>
using namespace std;

#define NO_OF_THREADS 4

int countArray[100];
std::mutex g_pages_mutex;

int trimWhiteSpaces(char *arr, int start, int len)
{
    int i = 0;
    for(; i < len; i++)
    {
        char c = arr[i];
        if(c == ' ')
        {
            continue;
        }
        else
            break;
    }

    return i;
}

void getWordCount(char *arr, int len, int ID)
{

    int count = 0;
    bool isSpace = false;
    int i = 0;
    i = i + trimWhiteSpaces(arr, i, len);
    for(; i < len; i++)
    {
        char c = arr[i];
        if(c == ' ')
        {
            i = i + trimWhiteSpaces(&arr[i], i, len) - 1;
            //printf("Found space");
            isSpace = true;
            count++;
        }
        else
        {
            isSpace = false;
        }
    }

    if(isSpace)
        count = count - 1;
    count = count + 1;
    g_pages_mutex.lock();
    cout << "MYCOUNT:" << count << "\n";
    countArray[ID] = count;
    g_pages_mutex.unlock();
}

int main(int argc, const char * argv[])
{
    char fileData[5000];
    std::thread threadIDs[100];
    int noOfThreads = NO_OF_THREADS;
    char *filePath = "/Users/abc/Desktop/test.txt";
    int read_sz = 0;
    int decrements = 0;
    bool previousNotEndsInSpace = false;

    std::ifstream is(filePath, std::ifstream::ate | std::ifstream::binary);
    int fileSize = is.tellg();
    int bulkSize = fileSize / NO_OF_THREADS;
    is.seekg(0);


    for(int iter = 0; iter < NO_OF_THREADS; iter++)
    {
        int old_read_sz = read_sz;
        is.read(fileData, bulkSize);
        read_sz = is.tellg();
        fileData[read_sz - old_read_sz] = '\0';
        if(read_sz > 0)
        {
            cout << " data size so far: " << read_sz << "\n";
            cout << fileData << endl;
            if(previousNotEndsInSpace && fileData[0] != ' ')
            {
                decrements = decrements + 1;
            }

            if(fileData[read_sz - 1] != ' ')
            {
                previousNotEndsInSpace = true;
            }
            else
            {
                previousNotEndsInSpace = false;
            }
            //getWordCount(fileData, strlen(fileData), iter);
            threadIDs[iter] = std::thread(getWordCount, fileData, strlen(fileData), iter);
        }
    }

    for(int iter = 0; iter < NO_OF_THREADS; iter++)
    {
        threadIDs[iter].join();
    }

    int totalCount = 0;
    for(int iter = 0; iter < NO_OF_THREADS; iter++)
    {
        cout << "COUNT: " << countArray[iter] << "\n";
        totalCount = totalCount + countArray[iter];
    }

    cout << "TOTAL: " << totalCount - decrements << "\n";
    return 0;
}
Community
  • 1
  • 1
Seema Kadavan
  • 2,538
  • 1
  • 16
  • 31