Print out the occurence of each word in a given text file

Question

We are given a text file with some text and we are required to print out the occurence of each word. If the same word is written in capital or not doesnt matter for example: "immenso" and "IMMENSO" are considered the same word

I kinda have done it but not really, in the sense that it prints out more than i want. Thing is i cant figure out why its not working.

The text is:

Mi illumino di immenso
Illumino di immenso
Di immenso
IMMENSO

And the result should have been

immenso 4
di 3
illumino 2
Mi 1

This is the code i have made so far:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#define N 100

typedef struct words {
    int iteration;
    char word[N];
} word_s;

int
main()
{
    FILE *file;

    file = fopen("file", "r");

    word_s v[N];

    if (file == NULL) {
        printf("Error");
        exit(1);
    }
    int i = 0,
        j = 1,
        n = 0;
    char str[N];

    while (fscanf(file, "%s", str) != EOF) {
        if (n == 0) {
            strcpy(v[n].word, str);
            v[n].iteration = 1;
        }

        for (i = 0; i < n; i++) {
            if (stricmp(v[i].word, str) != 0) {
                strcpy(v[n].word, str);
                v[n].iteration = 1;

            }
            if (stricmp(v[i].word, str) == 0) {
                v[i].iteration++;
            }

        }

        n++;
    }
    for (i = 0; i < n; i++) {
        printf("word %s: %d iterations \n", v[i].word, v[i].iteration);
    }

    return 0;
}

Basically what i tried to do in the while loop is that with each new word we get from the fscanf(...,"%s",...); we compare it the the already stored words and if it doesnt exist in the stored data structure we add it, else if we find it already stored we increment the counter.

Originally i wanted to add the n++ into this part

if(stricmp(v[i].word,str)!=0)
        {
            strcpy(v[n].word,str);
            v[n].iteration=1;
            n++
        }

but it wasnt working

What i get as a result is:

Mi 1
illumino 2
di 3
immenso 4
Illumino 1
di 2
immenso 3
Di 1
immenso 2
IMMENSO 1

So basically it has stored every word and counted from that position on.

Any idea?

The first `if` condition is replacing every word that doesn't match `str` with `str`. You should only add `str` to the array if you get to the end of the loop without finding a match. — Barmar, Apr 06 '23 at 01:25
If you do that, there's no need to treat `n==0` as a special case. When the array is empty, you'll get to the end immediately. — Barmar, Apr 06 '23 at 01:25
`stricmp` (aka `strcasecmp`) is slow because it has to do an extra table lookup. Better to lowercase the strings as you read them (after the `fscanf`): `for (char *cp = str; *cp != 0; ++cp) *cp = tolower((unsigned char) *cp);` Then, you can just use `strcmp` for comparisons. For an example of how to do a dictionary, see my cs50 speller answer: [cs50 pset5 Speller optimisation](https://stackoverflow.com/a/71316468/5382650) — Craig Estey, Apr 06 '23 at 01:35
@Barmar I tried to do something like this: In the if statements in the for loop i put a flag called found. So if there are no matches until the end of the current loop found is set to 0 otherwise there are matches found=1 and then i break the loop. In a separate if statement outside the for loop i put *if(found==1) add word and set incrementation to 1* otherwise *if(found==0) increment++*. So far i managed to actually print out the "unique" words but for some reason the incrementation doesnt work but thats something so thank you — Severjan Lici, Apr 06 '23 at 01:44
@SeverjanLici If you think of every word as a _key_ that you lookup in a dictionary where you store the number of times you've encountered that _key_, does that give any inspiration? If you don't want to go into hashing, a simple [binary tree](https://en.wikipedia.org/wiki/Binary_tree) will do. — Ted Lyngmo, Apr 06 '23 at 01:46
@TedLyngmo Unfortunately im currently unfamiliar with binary trees — Severjan Lici, Apr 06 '23 at 01:51
@SeverjanLici The link explains them though. A word lexigographically "less" (`strcmp(a,b) < 0`) than another word is stored in the left branch, otherwise the right etc. — Ted Lyngmo, Apr 06 '23 at 01:52
The other option is a simplified implementation of a hash table, where you simply take a hash of the word and store the word at that bucket (index) in your array of pointers. If a collision occurs (2 strings hash to the same index -- unlikely) you can just link the 2nd word to the first. A very fast and efficient hash functions for words is [openssl - lh_strhash](https://docs.huihoo.com/doxygen/openssl/1.0.1c/include_2openssl_2lhash_8h.html#a7bac1f37e4c347fea2158c09882f273f) which has no collisions hashing the 239,000 words in the English `/usr/share/dict/words` file. — David C. Rankin, Apr 06 '23 at 02:09
@DavidC.Rankin Again unfortunately im unfamiliar with the more complex strategies of programming. I just recently studied the linked lists and this exercise is required to use those but i just wanted to do it this way first to get a simpler idea on how the exercise works! — Severjan Lici, Apr 06 '23 at 02:14
_"I just recently studied the linked lists and this exercise is required to use those"_ - Are you sure? A binary tree and a linked list share many commonalities. If you do draw a binary tree with a pen and paper and have a linked list in mind you should see some common things. — Ted Lyngmo, Apr 06 '23 at 02:50
@SeverjanLici - nothing wrong with a brute-force approach for small input sets (100,000 words of less or so -- no real hard limit, but as you get toward 1,000,000 the loop traversals begin the add significant overhead) Short answer provided which you can adapt as needed. — David C. Rankin, Apr 06 '23 at 02:54

David C. Rankin · Accepted Answer · 2023-04-07T01:04:02.637

Alright, let's look at one way you can brute-force your way to a collection of unique words without the use of a binary-tree or hash-table, just using a good old for loop to check all existing words.

As mentioned in the comments above, you do not need a special case for the 1^st word. Simply using a for loop and looping for (i=0; i<n; i++) will skip the loop entirely when n is zero.

In your code above, you have another logic error pertaining to when n is incremented. Since you are concerned with duplicate words, you only want to increment n (the number of words in your array) when you add a word, not on every iteration.

There is another optional consideration not specified in your question. Since you are only concerned with a case-insensitive comparison, do you need to print out the word in its original case before conversion to lower-case? A second array holding the original is a convenient way to preserve it. You can either use a struct to hold the str and lowercase_str (and use an array of those), or you can simply use two 2D arrays where the common index coordinates words from each. (an array of struct is the more elegant solution)

(Note: a 100 x 100 character array is just 10k of storage - trivial with a 1M stack minimum default on Windows, more trivial with the default 4M stack on Linux. You could have roughly 100 of them on windows...)

As also mentioned in the comments, stricmp() (e.g. strcasecmp()) isn't the speediest of functions, and is not Standard C (it's POSIX). You can simply use a loop to convert to lower-case using tolower() and ensure portability. (though virtually all compilers also support the POSIX functions -- with a proper define)

That said, you can change your code similar to the following to obtain the unique words in your file. Note: you should take the filename as the first argument to your program (or read from stdin by default). Another alternative is to prompt the user and read the filename as input. You should not need to recompile your code just to read from a different input file.

Putting all the pieces together, you can do something similar to the following (with the optional 2nd words array preserving the original case for output), e.g.

#include <stdio.h>
#include <string.h>
#include <ctype.h>

#define N 100

/* provide filename as 1st argument or read stdin by default */
int main (int argc, char **argv)
{
  int i = 0,
      n = 0,
      maxlen = 0;     /* keep longest word length for formatting output */
  char str[N],        /* buffer to hold string from file */
       lcstr[N],      /* buffer to hold lower-case string */
       words[N][N],   /* 2D array to hold 100 words of 99 char */
       lcwords[N][N]; /* 2D array to hold 100 lowercase words of 99 char */
  /* use filename provided as 1st argument (stdin by default) */
  FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;

  if (!fp) {  /* validate file open for reading */
    perror ("file open failed");
    return 1;
  }
  
  /* loop continually reading word up to 99-chars from file.
   * You cannot use "%s" to fill an array without protecting the array
   * bounds by using the field-width modifier. Protect your words arrays
   * by checking n < N.
   */
  while (n < N && fscanf (fp, "%99s", str) == 1) {
    int found = 0;    /* found flag */
    /* convert to lowercase in lcstr */
    for (i = 0; str[i]; i++) {
      lcstr[i] = tolower((unsigned char)str[i]);
    }
    lcstr[i] = 0;     /* don't forget to nul-terminate lcstr */
    if (i > maxlen) { /* check and save maxlen */
      maxlen = i;
    }
    /* loop checking if word exist in words */
    for (i = 0; i < n; i++) {
      if (strcmp(lcwords[i], lcstr) == 0) {
        found = 1;
        break;
      }
    }
    /* add word to words and lcwords array if not found.
     * only increment n when adding a word to array.
     */
    if (!found) {
      strcpy (words[n], str);
      strcpy (lcwords[n++], lcstr);
    }
  }
  if (fp != stdin)   /* close file if not stdin */
    fclose (fp);
  
  puts ("\nunique words:");
  for (i = 0; i < n; i++) {
    printf ("  %-*s    lowercase: %s\n", maxlen, words[i], lcwords[i]);
  }
}

Two big notes:

(1) you have nothing that needs #include <stdlib.h>. Only include headers that you need. When you use a function -- look it up in the man-page (e.g. at man7.org - Online Man Pages. The man page for each function (while cryptic at first) provides concise information on how to use each function -- including the header(s) required;

(2) you cannot use "%s" alone when reading characters into an array. You must use the field-width modifier to protect from writing beyond the array bounds. In your case with a max of 100-characters, you must limit the read by fscanf() to 99 characters with "%99s" (to ensure 1-char of storage remains for the nul-terminating character '\0' (equivalent to plain-old 0).

Example Use/Output

With your data in the file dat/uniquewords.txt, you could do:

$ ./bin/unique_words dat/uniquewords.txt

unique words:
  Mi          lowercase: mi
  illumino    lowercase: illumino
  di          lowercase: di
  immenso     lowercase: immenso

Which outputs the unique words in their original case preserved in the words array. The array of lowercase words lcwords is used for comparison for duplicates. Also note how the found flag is used to only add words not found (you could use i == n as well -- up to you)

Let me know if you have questions.

Using Array of struct To Add Occurrence Count

Per your comment indicating you need to capture the occurrence count, you can do so by simply adding an else to the if (!found) block. Essentially, your logic is:

    /* add word to words and lcwords array if not found.
     * only increment n when adding a word to array.
     */
    if (!found) {
      words[n] = tmp;         /* assign temp struct to words */
      words[n++].count = 1;   /* set count at 1 */
    }
    else {  /* otherwise */
      words[i].count += 1;    /* increment occurrence count at i */
    }

Rewriting to use an array of struct (optionally holding the original form of the 1st occurrence encountered) and lowercase form, you could rearrange as follows, adding the occurrence count as well:

#include <stdio.h>
#include <string.h>
#include <ctype.h>

#define N 100

typedef struct {    /* struct holding word, lowercase word and count */
  char word[N],
       lcword[N];
  int count;
} word_t;

/* provide filename as 1st argument or read stdin by default */
int main (int argc, char **argv)
{
  int i = 0,
      n = 0,
      maxlen = 0;     /* keep longest word length for formatting output */
  char str[N];        /* buffer to hold string read from file */
  word_t words[N];    /* array of struct to hold 100 words of 99 char */
  /* use filename provided as 1st argument (stdin by default) */
  FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;

  if (!fp) {  /* validate file open for reading */
    perror ("file open failed");
    return 1;
  }
  
  /* loop continually reading word up to 99-chars from file.
   * You cannot use "%s" to fill an array without protecting the array
   * bounds by using the field-width modifier. Protect your words arrays
   * by checking n < N.
   */
  while (n < N && fscanf (fp, "%99s", str) == 1) {
    int found = 0;                /* found flag */
    word_t tmp = { .word = "" };  /* temporary struct to fill */
    /* fill temp struct, convert to lowercase in lcword */
    for (i = 0; str[i]; i++) {
      tmp.word[i] = str[i];
      tmp.lcword[i] = tolower((unsigned char)str[i]);
    }
    tmp.word[i] = 0;    /* don't forget to nul-terminate lcstr */
    tmp.lcword[i] = 0;
    
    if (i > maxlen) {   /* check and save maxlen */
      maxlen = i;
    }
    
    /* loop checking if word exist in words */
    for (i = 0; i < n; i++) {
      if (strcmp (words[i].lcword, tmp.lcword) == 0) {
        found = 1;
        break;
      }
    }
    /* add word to words and lcwords array if not found.
     * only increment n when adding a word to array.
     */
    if (!found) {
      words[n] = tmp;         /* assign temp struct to words */
      words[n++].count = 1;   /* set count at 1 */
    }
    else {  /* otherwise */
      words[i].count += 1;    /* increment occurrence count at i */
    }
  }
  
  if (fp != stdin)   /* close file if not stdin */
    fclose (fp);
  
  puts ("\nunique words:");   /* output unique words */
  for (i = 0; i < n; i++) {
    printf ("  %-*s    count: %2d    lowercase: %s\n", 
            maxlen, words[i].word, words[i].count, words[i].lcword);
  }
}

Example Use/Output

Using the same datafile, the output would now be:

$ ./bin/unique_words_struct dat/uniquewords.txt

unique words:
  Mi          count:  1    lowercase: mi
  illumino    count:  2    lowercase: illumino
  di          count:  3    lowercase: di
  immenso     count:  4    lowercase: immenso

Let me know if that will do it for you.

First of all thank you for the detailed and commented answer. I did actually manage to find the "unique words" and print them. The reason i defined a structure though was so that i could also find out the occurrence of each word. What i did was , like the code above used a *found* flag, and i do a for loop inside the while(fscanf) statement that says *for i up to the current n* check to see if there are matches. if there arent found=0 else if there are matches found=1 and break. Then outside the loop i did: "if found==0" then copy the word into the nth position and inc n else just inc the occ — Severjan Lici, Apr 06 '23 at 12:07
Unfortunately while i did mange to print out the "unique" words. Their occurence doesnt seem to increase. — Severjan Lici, Apr 06 '23 at 12:07
Do you need to capture the occurrence as well? If so, just add an `else` statement to `if (!found) { ... /* occurrences = 1 */ } else { /* increase occurrence count */ }` Let me know if you are stuck and I'll add an update. — David C. Rankin, Apr 07 '23 at 00:51
Thank you again for posting a detailed answer and for trying to implement it my way. — Severjan Lici, Apr 08 '23 at 00:07
Glad to help. The only real difference between the two versions is the storage for the data. The remainder of the logic is the same. You can take it one step further and store each unique *case* occurrence in an array of pointers (or pointer to pointer) in the struct. Each time you have a unique case combination, you just add that word to your collection. Once you learn more about data structures, you can use a linked-list for the unique-case strings and eliminate tracking the current number you have, etc... Good luck with your coding! — David C. Rankin, Apr 08 '23 at 01:17

Print out the occurence of each word in a given text file

1 Answers1