0

I have written this code to find the number of occurrence of a word in a file in C. The code works just fine. But definitely it takes a lot of time. Inorder to count the number of occurence of a word in a file of 650MB size, it takes 151.1 seconds which is a lot of time. I want to process it at 80MB/second. How can I improve the time complexity? Many thanks

#include <ctype.h>
#include <stdlib.h>
#include <string.h>
int main(){
    FILE *fptr;
    int l,i=0,count=0,total=0;
    char name[100],n,word[25],k;
    printf("\nEnter the word to be found:");
    scanf("%s",word);
    l=strlen(word);
    printf("\nEnter the file name:");
    scanf("%s",name);
    fptr=fopen(name,"r");
    if(fptr==NULL){
        printf("\nProblem with opening the file");
        exit(1);
    }
    n=fgetc(fptr);
    while((feof(fptr)==0)){
        if(n==toupper(word[i])||n==tolower(word[i])){
            count++;
            i++;
        }
        else if(n!=word[i]){
            if(count>1){
                fseek(fptr, -count, SEEK_CUR);
            }
            count=0;
            i=0;
        }
        if(count==l){
            total++;
            count=0;
            i=0;
        }
        n=fgetc(fptr);
    }
    if(total==0){
        printf("\nThe word %s does not exist in the file",word);
    }
    printf("\nThe word %s occurred %d time(s) in the file",word,total);
}
  • Is it limited by the CPU, or by the I/O? – pascal Jul 08 '20 at 16:18
  • https://stackoverflow.com/questions/12629749/how-does-grep-run-so-fast `the time complexity of a file IO program?` Are you sure you are asking about the _time complexity_ of an algorithm, not just how to do it faster? Time complexity and execution time are, well, related things, but still different. – KamilCuk Jul 08 '20 at 16:18
  • 3
    You can't improve the time complexity--it's already O(n), you just want to make it faster. The most straightforward way to do that is to read the file in big chunks--ideally one big chunk--and do your word-finding on those memory buffers. – Lee Daniel Crocker Jul 08 '20 at 16:20
  • @KamilCuk I just want it to be faster – Sharon Shelton Jul 08 '20 at 16:21
  • @LeeDanielCrocker That's probably what I want to do. How can I make it happen? – Sharon Shelton Jul 08 '20 at 16:24

2 Answers2

1

Your program is also likely suffering from a form of I/O amplification where it rereads the same data over and over and over.

This is your main file-reading loop:

n=fgetc(fptr);
while((feof(fptr)==0)){
    if(n==toupper(word[i])||n==tolower(word[i])){
        count++;
        i++;
    }
    else if(n!=word[i]){
        if(count>1){
            fseek(fptr, -count, SEEK_CUR);
        }
        count=0;
        i=0;
    }
    if(count==l){
        total++;
        count=0;
        i=0;
    }
    n=fgetc(fptr);
}

Reducing that to only I/O calls:

n=fgetc(fptr);
while((feof(fptr)==0)){
    if(n!=word[i]){
        if(count>1){
            fseek(fptr, -count, SEEK_CUR);
        }
        count=0;
        i=0;
    }

    n=fgetc(fptr);
}

What is happening:

  1. You open the file in read-only mode
  2. Since the file is buffered, when you first call fgetc() your program actually reads the file from its current offset and fills up its buffer. That means your program can read up to several kB (usually 4kB or 8kB depending on your system) immediately.
  3. Your program loops through a few calls to fgetc() that each return a char value (held in an int) to your code. Most times, that char is simply copied from the buffer associated with fptr.
  4. Your program calls fseek(). That call invalidates the buffered data.
  5. On your next call to fgetc(), your program fills up its buffer again, most of the time rereading data that has already been read.

Depending on how often your program calls fseek(), your program likely reads several hundred to several thousand times more data in than is actually contained in the file.

It's not quite as bad as it seems though because most of the reads are hopefully not being read all the way from the disk but are satisfied by your system's page cache. But each one of the fseek() calls results in an extraneous context switch that, along with all the extra calls to read a char at a time by using fgetc(), is likely slowing down your program considerably.

Simply reading large chunks of data with something like fread() will work, but because you "back up" in the data stream (your fseek() calls), you have to account for the possibility of "backing up" into the previous chunk of data.

And that's a bit difficult and tedious to do reliably.

The easiest solution if words don't continue across two lines is to read by line using fgets() (or getline() on POSIX systems):

for (;;)
{
    // define MAX_LINE_LENGTH to a suitable value
    char line[ MAX_LINE_LENGTH ];

    char *result = fgets( line, sizeof( line ), fp );

    // EOF (or error - either way there's no more data to be read)
    if ( result == NULL )
    {
        break;
    }

    // remove newline (if you want)
    line[ strcspn( line, "\n" ) ] = '\0';

    // now process a line of text
        .
        .
        .
}

Reading in lines also allows the use of standard functions such as strtok() to split input into separate words and then the use of strncasecmp() to find case-insensitive matches to the word you're looking for.

Andrew Henle
  • 32,625
  • 3
  • 24
  • 56
0

Read a larger buffer at once. fgetc() is for reading a single byte at a time which is the smallest possible amount you can read, so you are maximizing the number of steps required to read a file. Each read operation has some overhead. (Each fgetc call won't necessarily result in an actual read from the disk -- there is some caching and read-ahead happening behind the scenes.) So the fewer calls you make, the less the program has to doo to process the same amount of data.

Technically, reading in larger batches won't decrease the "time complexity." It'll still be roughly linear in terms of file size, so it's the same category of complexity. It'll just be a whole lot faster, which is what you actually care about.

Also, I know you are just showing short sample code for the sake of the question, but you are reading into fixed size buffers "word" and "name" with unsafe scanf() calls. Since word is only 25 bytes long, if a user enters a 26 character long word, they can potentially crash or exploit your program.

wrosecrans
  • 1,055
  • 1
  • 8
  • 14