How to read a file and print the number of different words in c

Question

So the first part of this code is reading any random text file and printing the total number of words in it, which I understand, but for the second part (the ?????? part) the number of different words must be printed. Not the number of unique words, which are words that only occur once, but different words, which are the unique words plus one of each repeating word.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX 80

typedef char string[MAX+1];

void main()
{
    char file[MAX], s[MAX];
    int count = 0, i, j;
    FILE *inFile;

    printf("Input file name: ");
    scanf("%s", &file);
    inFile = fopen(file,"r");

    if (inFile == NULL)
    {
        printf("\n\nFile does not exist or cannot be opened.\n");
        exit(1);
    }
    while (fgets(s, MAX, inFile) !=NULL)
    {
        for (i = 0; s[i] != '\0';i++)
            {
            if (s[i] == ' ')
                count++;
        }
    }


    int total= count + 1;
    printf("The total number of words in the file is: %d\n", total);

    ?
    ?
    ?
    ?
    ?
    ?
    ?

    fclose(inFile);
    int different = ?
    printf("The total number of different words in the file is: %d\n", different);
    *
    *
    *

How do I go about counting and printing this?

You nee do keep track of all the words that you've read so far. Create a structure that stores every word with a counter and when you read a new word, look up if the word has already been read. If that's the case then increment the count by one, otherwise add a new word to the dictionary with a counter of 1. Easy peasy. Now I've gave you the idea, try to implement that yourself. — Pablo, Apr 24 '18 at 00:36
`if (s[i] == ' ') count++;` What happens if the file contains `" one \n"`? or `" one two \n"`? — David C. Rankin, Apr 24 '18 at 02:02

score 0 · Answer 1 · answered Apr 24 '18 at 00:38

0

Use the concepts of HashSet
Put the words you are getting in a HashSet then count the number of values in that HashSet

answered Apr 24 '18 at 00:38

Naman Mehta

26
10

This would at best be a comment (when you reach a Rep. of 100), it is not an answer to the question and poses a hashtable solution that is well beyond the scope of the assignment at hand. I won't ding you this time, but avoid posting comments as answers. – David C. Rankin Apr 24 '18 at 03:47

score -1 · Answer 2 · answered Apr 24 '18 at 01:00

-1

You have to read the file word-by-word. See this answer.

As you read the words, you need to store them in an array.

After that's done, the size of the array (the number of non empty spots) is the number of words.

Now the number of unique words is a little trickier but you now have to use a nested loop to iterate over the array (one word at a time) and using strcmp, compare each word with the other words in the array and count how many times each word appears in the list. You also have to watch out for duplicate words.

Example of this last part:

char** words = ...; /* Assuming you have read the words into this */
int word, number_of_words = ...; /* Assuming you have number of words */
for (word = 0; word < number_of_words; word++) {
    int i = word + 1;
    unsigned wc = 0;
    while (i < number_of_words) {
        if (strcmp(words[i], words[word]) == 0) {
            wc++;
        }
        i++;
    }
    printf ("Count of \"%s\" is: %u\n", words[word], wc);
}

The above example does not take into account duplicate counts, so you have to handle that yourself.

answered Apr 24 '18 at 01:00

smac89

39,374
15
132
179

Only problem with the `"answer"` you cite for reading words with `fscanf` is you have to further process each word to avoid capturing possessives and plural-possessives as words (e.g `"smac89's"` answer) This will pose a similar challenge for a tokenizing solution as well. – David C. Rankin Apr 24 '18 at 03:44
@DavidC.Rankin I'm assuming by the low complexity of the problem that OP will not be dealing with such input. I assume the file he is reading consists of simple words without any extra punctuation apart from maybe the basic comma or fullstop, all of which I expect the OP to be able to deal with. The problem does not seem complex, but just the exact type of grunt work most programming courses put their students through – smac89 Apr 24 '18 at 03:55
I agree with you. Either way you go, either a per-word `fscanf`, a `fgets` and `strtok` or a simple `fgetc` with a couple of test clauses - they will all need a few extra tweaks to handle the *apostrophe-s* circumstance. – David C. Rankin Apr 24 '18 at 04:14

How to read a file and print the number of different words in c

2 Answers2