2

I'm writing a program in C that processes a text file and keeps track of each unique word (by using a struct that has a char array for the word and a count for its number of occurrences) and stores this struct into a data structure. However, the assignment has this included: "The entire txt file may be very large and not able to be held in the main memory. Account for this in your program."

I asked him after class, and he said to read the text file by X lines at a time (I think 20,000 was his suggestion?) at a time, analyze them and update the structs, until you've reached the end of the file.

Can anyone help explain the best way to do this and tell me what functions to use? I'm very, very new to C.

(my current program is accurate and correct for small files, I just need to make it accommodate enormous files).

Thank you so much!!

EDIT:

        fp = fopen(argv[w], "r");
        if ((fp) == NULL){
           fprintf( stderr, "Input file %s cannot be opened.\n", argv[w] );
         return 2;
        }

        /* other parts of my program here */

        char s[MaxWordSize];

        while (fscanf(fp,"%s",s) != EOF){   
            nonAlphabeticDelete(s); // removes non letter characters

            toLowerCase(s); //converts the string to lowercase

            //attempts to add to data structure 
            pthread_mutex_lock(&lock);
            add(words, &q, s);
            pthread_mutex_unlock(&lock);
        }

This works, I just need to adjust it to go X lines at a time through the text file.

balloon
  • 49
  • 2
  • 7
  • 1
    Share some of your code, so we could give you better answers... – eyalm Dec 04 '15 at 05:00
  • It sounds like the teacher wants you to allocate 20k block of mem and use `fread` to read that many bytes at once then process that block. – 001 Dec 04 '15 at 05:04
  • 1
    This looks like you're not reading the entire file to memory anyway? I don't think you'd have a problem with large text files. – maxton Dec 04 '15 at 05:13
  • @maxton There's a heavy language barrier between me and my professor, that's what he gave me, so this is literally all the information I have to go off of... :( – balloon Dec 04 '15 at 05:15
  • 1
    I think your prof might have assumed you'd be reading the entire file to memory, then parsing it. But you're just reading one word at a time from disk. Reading 20k lines and then parsing them would make your program less memory efficient! – maxton Dec 04 '15 at 05:19
  • @maxton you just made my life a lot less stressful... thank you~! – balloon Dec 04 '15 at 05:25
  • Unless you are working on tiny embedded systems (which are seldom called upon to do text processing), you're unlikely to run into problems. Even a big book like the Bible or The Complete Works of Shakespeare will only have a few thousand distinct words (say less than 100k distinct words), and the average word length is likely under 12, so with 16 bytes (12 for word and 4 for count) on average and 100k distinct words (I'm pretty sure that's a significant, if not gross, over-estimate) would be using less than 2 MiB of memory. Few modern systems would find that stressful. – Jonathan Leffler Dec 04 '15 at 06:49
  • The other probable reason for the suggestion is to show you the exponential improvement in program efficiency you can achieve by reading fewer times and parsing the information you need while the block of text is in memory. *Disc IO* is just about the slowest bottleneck on modern computers still spinning hard discs. You are currently reading `1 - (one)` *word* at a time, and that may necessitate 10 - 20 disc read operations for every line. If you read `20k` lines at a time, you would eliminate up to `400k` read operations per chunk of memory - that is one heck of an improvement. – David C. Rankin Dec 04 '15 at 08:42
  • The input stream, associated with a _text file_, is _fully buffered_; there are not _10 - 20 disc read operations for every line_ unless every line would have 10 - 20 times the size of the buffer. – Armali Aug 11 '17 at 13:37

3 Answers3

1

How about getline() ? Here an example from the manpage http://man7.org/linux/man-pages/man3/getline.3.html

   #define _GNU_SOURCE
   #include <stdio.h>
   #include <stdlib.h>

   int
   main(void)
   {
       FILE *stream;
       char *line = NULL;
       size_t len = 0;
       ssize_t read;

       stream = fopen("/etc/motd", "r");
       if (stream == NULL)
           exit(EXIT_FAILURE);

       while ((read = getline(&line, &len, stream)) != -1) {
           printf("Retrieved line of length %zu :\n", read);
           printf("%s", line);
       }

       free(line);
       fclose(stream);
       exit(EXIT_SUCCESS);
   }
Magnus Lutz
  • 558
  • 3
  • 12
  • `getline` is a fine function, but is suffers from the same limitation as `fgets` here -- you can read at most one line-at-a-time before it will return. It has a partner named `getdelim` that would allow reading multiple lines, or perhaps the whole document depending on the delimiter.Another alternative is with `fread`. – David C. Rankin Dec 04 '15 at 09:30
  • It is absolutely sufficient to read just one line, since the aim is only to overcome the limitation "_The entire txt file may be very large and not able to be held in the main memory_". It is not even necessary to read a whole line - the original `fscanf(fp,"%s",s)` is already fine. _stdio_ buffering takes adequate care of performance. – Armali Aug 11 '17 at 13:21
0

This is best done by reading some manuals but I can provide a headstart.

FILE *fp;
fp=fopen("fileToRead.txt", "rb");
if (!fp) { /* handle failure! */ }
#define GUESS_FOR_LINE_LENGTH 80
char sentinel = '\0';
while ((sentinel = getc(fp)) != EOF)
{
    ungetc(sentinel, fp);
    char buffer[20000*GUESS_FOR_LINE_LENGTH];
    size_t numRead = fread(buffer, 1, 20000*GUESS_FOR_LINE_LENGTH, fp);
    if (numRead < 20000*GUESS_FOR_LINE_LENGTH) { /*last run */ }
    /* now buffer has numRead characters */
    size_t lastLine = numRead - 1;
    while (buffer[lastLine] != '\n') { --lastLine; }
    /* process up to lastLine */
    /* copy the remainder from lastLine to the front */
    /* and fill the remainder from the file */
}

This is really more like pseudo-code. Since you mostly have a working program, you should use this as a guideline.

Aaditya Kalsi
  • 1,039
  • 8
  • 16
  • 1
    Dont use `feof()`. Read this: http://stackoverflow.com/questions/5431941/why-is-while-feof-file-always-wrong – chqrlie Dec 04 '15 at 05:07
  • undefined behavior on end of file (`numRead == 0`) – chqrlie Dec 04 '15 at 05:10
  • Does not work correctly if at all: make `sentinel` an `int` variable. Allocating megabytes on the stack is not a good idea, allocate the buffer with `malloc` of better use `mmap`. – chqrlie Dec 04 '15 at 05:29
  • Shouldn't sentinel be an int so that it can properly store EOF? –  Dec 04 '15 at 05:31
  • It is a good idea to open the file in binary mode, you should also read chunks that are a multiple of the system page size: instead of `20000` use `16384` or some higher power of 2 and allocate the buffer with `malloc` with some extra slack to handle the partial line. – chqrlie Dec 04 '15 at 05:37
0

First try reading one line at a time. Scan the line buffer for word boundaries and fine-tune the word counting part. Using a hash table to store the words and counts seems a good approach. Make the output optional, so you can measure read/parse/lookup performance.

Then make another program that uses the same algorithm for the core part but uses mmap to read sizeable parts of the file and scan the block of memory. The tricky part is handling the block boundaries.

Compare output from both programs on a set of huge files, ensure the counts are identical. You can create huge files by concatenating the same file many times.

Compare timings too. Use the time command line utility. Disable output for this benchmark to focus on the read/parse/analysis part.

Compare the timings with other programs such as wc or cat - > /dev/null. Once you get similar performance, the bottleneck is the speed of reading from mass storage, there is not much room left for improvement.

EDIT: looking at your code, I have these remarks:

  • fscanf is probably not the right tool: at least you should protect for buffer overflow. How should you handle foo,bar 1 word or 2 words?

  • I would suggest using fgets() or fread and moving a pointer along the buffer, skipping the non word bytes, converting the word bytes to lower case with an indirection through a 256 byte array, avoiding copies.

  • Make the locking stuff optional via a preprocessor variable. It is not needed if the words structure is only accessed by a single thread.

  • How did you implement add? What is q?

chqrlie
  • 131,814
  • 10
  • 121
  • 189