How to count words in a file?

Question

I'm creating a program that counts how many words there are in a input file. I can't seem to figure out how to make it define a word with either whitespace, a period, a comma, or the beginning or end of a line.

Contents of input file:

hello world ALL is great. HELLO WORLD ALL IS GREAT. hellO worlD alL iS great.

Output should be 15 words meanwhile my output is 14

I've tried adding or's that include periods, commas etc. but it just counts those on top of the spaces as well.

#include <iostream> 
#include <string>
#include <fstream>
using namespace std;

//Function Declarations
void findFrequency(int A[], string &x);
void findWords(int A[], string &x);

//Function Definitions
void findFrequency(int A[], string &x)
{   

    //Counts the number of occurences in the string
    for (int i = 0; x[i] != '\0'; i++)
    {

        if (x[i] >= 'A' && x[i] <= 'Z')
            A[toascii(x[i]) - 64]++;
        else if (x[i] >= 'a' && x[i] <= 'z')
            A[toascii(x[i]) - 96]++;
    }

    //Displaying the results
    char ch = 'a';

    for (int count = 1; count < 27; count++)
    {
        if (A[count] > 0)
        {

            cout << A[count] << " : " << ch << endl;
        }
        ch++;
    }
}


void findWords(int A[], string &x)
{

    int wordcount = 0;
    for (int count = 0; x[count] != '\0'; count++)
    {

        if (x[count] == ' ')
        {
            wordcount++;
            A[0] = wordcount;
        }
    }
    cout << A[0] << " Words " << endl;
}



int main()
{
    string x;
    int A[27] = { 0 }; //Array assigned all elements to zero
    ifstream in;    //declaring an input file stream
    in.open("mytext.dat");

    if (in.fail())
    {
        cout << "Input file did not open correctly" << endl;
    }

    getline(in,x);
    findWords(A, x);
    findFrequency(A, x);

    in.close();

    system("pause");
    return 0;
}

The output should be 15 when the result I am getting is 14.

@Renat Better yet, `std::set`. The OP doesn't need a key-value, just keys. — NathanOliver, Jul 02 '19 at 21:39
Why should the output be 15? The are 15 words, but only 14 of them are distinct by case (`great` is the same in the first and last sentence). Really the answer should be 5 since case doesn't matter for distinct words. — NathanOliver, Jul 02 '19 at 21:40
@NathanOliver, that's right, `set` definitely suits here perfectly instead of map. — Renat, Jul 02 '19 at 21:41
What's with the magic numbers of 64 and 96? Maybe you should use the character literals instead. Or better yet, see `std::toupper` and `std::tolower`. — Thomas Matthews, Jul 02 '19 at 21:44
`std::isalpha` will tell you whether the value in a `char` encodes a letter, without you needing to hard-code values that don't work for some character encodings. You might want to supplement that with `std::ispunct` to detect punctuation. — Pete Becker, Jul 02 '19 at 21:47
Any time you call getline (or any other input function) without checking its return value, your program is wrong. — , Jul 02 '19 at 21:52
You aren't counting words, you are counting spaces. If there are 2 spaces after a word, you will call it two words. If your line doesn't end with a space, you won't count the last word. — stark, Jul 02 '19 at 21:57
@NathanOliver That was poor phrasing on my part. I need it to count each word and not the spaces even if its for example: Apple.Apple,orange grape — ElementMars, Jul 02 '19 at 22:06
@ThomasMatthews Professor wants us literals like that for some reason. — ElementMars, Jul 02 '19 at 22:07
@NeilButterworth First time using getline. How would you go about checking the return value? — ElementMars, Jul 02 '19 at 22:09

score 1 · Answer 1 · answered Jul 02 '19 at 22:09

Perhaps this is what you need?

size_t count_words(std::istream& is) {
    size_t co = 0;
    std::string word;
    while(is >> word) {       // read a whitespace separated chunk
        for(char ch : word) { // step through its characters
            if(std::isalpha(ch)) {
                // it contains at least one alphabetic character so
                // count it as a word and move on
                ++co;
                break;
            }
        }
    }
    return co;
}

Richard Chambers · Answer 2 · 2019-07-03T15:29:09.383

Here is an approach with a few test cases as well.

The test cases are a series of char arrays with particular strings to test the findNextWord() method of the RetVal struct/class.

char line1[] = "this is1    a  line. \t of text  \n ";  // multiple white spaces
char line2[] = "another   line";    // string that ends with zero terminator, no newline
char line3[] = "\n";                // line with newline only
char line4[] = "";                  // empty string with no text

And here is the actual source code.

#include <iostream>
#include <cstring>
#include <cstring>

struct RetVal {
    RetVal(char *p1, char *p2) : pFirst(p1), pLast(p2) {}
    RetVal(char *p2 = nullptr) : pFirst(nullptr), pLast(p2) {}
    char *pFirst;
    char *pLast;

    bool  findNextWord()
    {
        if (pLast && *pLast) {
            pFirst = pLast;
            // scan the input line looking for the first non-space character.
            // the isspace() function indicates true for any of the following
            // characters: space, newline, tab, carriage return, etc.
            while (*pFirst && isspace(*pFirst)) pFirst++;

            if (pFirst && *pFirst) {
                // we have found a non-space character so now we look
                // for a space character or the end of string.
                pLast = pFirst;
                while (*pLast && ! isspace(*pLast)) pLast++;
            }
            else {
                // indicate we are done with this string.
                pFirst = pLast = nullptr;
            }
        }
        else {
            pFirst = nullptr;
        }

        // return value indicates if we are still processing, true, or if we are done, false.
        return pFirst != nullptr;
    }
};

void printWords(RetVal &x)
{
    int    iCount = 0;

    while (x.findNextWord()) {
        char xWord[128] = { 0 };

        strncpy(xWord, x.pFirst, x.pLast - x.pFirst);
        iCount++;
        std::cout << "word " << iCount << " is \"" << xWord << "\"" << std::endl;
    }

    std::cout << "total word count is " << iCount << std::endl;
}

int main()
{
    char line1[] = "this is1    a  line. \t of text  \n ";
    char line2[] = "another   line";
    char line3[] = "\n";
    char line4[] = "";

    std::cout << "Process line1[] \"" << line1 << "\""  << std::endl;
    RetVal x (line1);
    printWords(x);

    std::cout << std::endl << "Process line2[] \"" << line2 << "\"" << std::endl;
    RetVal x2 (line2);
    printWords(x2);

    std::cout << std::endl << "Process line3[] \"" << line3 << "\"" << std::endl;
    RetVal x3 (line3);
    printWords(x3);

    std::cout << std::endl << "Process line4[] \"" << line4 << "\"" << std::endl;
    RetVal x4(line4);
    printWords(x4);

    return 0;
}

And here is the output from this program. In some cases the line to be processed has a new line in it which affects the output by performing a new line when printed to the console.

Process line1[] "this is1    a  line.    of text
 "
word 1 is "this"
word 2 is "is1"
word 3 is "a"
word 4 is "line."
word 5 is "of"
word 6 is "text"
total word count is 6

Process line2[] "another   line"
word 1 is "another"
word 2 is "line"
total word count is 2

Process line3[] "
"
total word count is 0

Process line4[] ""
total word count is 0

If you need to treat punctuation similar to white space, as something to be ignored, then you can modify the findNextWord() method to include the ispunct() test of characters in the loops as in:

bool  findNextWord()
{
    if (pLast && *pLast) {
        pFirst = pLast;
        // scan the input line looking for the first non-space character.
        // the isspace() function indicates true for any of the following
        // characters: space, newline, tab, carriage return, etc.
        while (*pFirst && (isspace(*pFirst) || ispunct(*pFirst))) pFirst++;

        if (pFirst && *pFirst) {
            // we have found a non-space character so now we look
            // for a space character or the end of string.
            pLast = pFirst;
            while (*pLast && ! (isspace(*pLast) || ispunct (*pLast))) pLast++;
        }
        else {
            // indicate we are done with this string.
            pFirst = pLast = nullptr;
        }
    }
    else {
        pFirst = nullptr;
    }

    // return value indicates if we are still processing, true, or if we are done, false.
    return pFirst != nullptr;
}

In general if you need to refine the filters for the beginning and ending of words, you can modify those two places with some other function that looks at a character and classifies it as either a valid character for a word or not.

How to count words in a file?

2 Answers2