Count the number of times each word occurs in a file

Question

Hi I am writing a program that counts the number of times each word occurs in a file. Then it prints a list of words with counts between 800 and 1000, sorted in the order of count. I am stuck on keeping a counter to see if the first word matches the next until a new word appears. In the main I am trying to open the file, read each word by word and call sort in the while loop to sort the vector. Then, in the for loop go through all the words and if the first word equals the second count++. I don't think that is how you keep a counter.

Here is the code:

#include <string>
#include <iostream>
#include <fstream>
#include <vector>
#include <algorithm>
#include <set>

using namespace std;

vector<string> lines;
vector<int> second;
set<string> words;
multiset<string> multiwords;

void readLines(const char *filename)
{
    string line;
    ifstream infile;
    infile.open(filename);
    if (!infile)
    {       
        cerr << filename << " cannot open" << endl; 
          return; 
    }       
    getline(infile, line);
    while (!infile.eof())
    {
        lines.push_back(line);
        getline(infile, line);
    }  
    infile.close();
}

int binary_search(vector<string> &v, int size, int value)
{
    int from = 0;
    int to = size - 1;
    while (from <= to)
    {  
        int mid = (from + to) / 2;
        int mid_count = multiwords.count(v[mid]);
        if (value == mid_count) 
            return mid;
        if (value < mid_count) to = mid - 1;
        else from = mid + 1;
    }
   return from;
}

int main() 
{
    vector<string> words;
    string x;
    ifstream inFile;
    int count = 0;

    inFile.open("bible.txt");
    if (!inFile) 
    {
        cout << "Unable to open file";
        exit(1);
    }
    while (inFile >> x){
        sort(words.begin(), words.end());
    }

    for(int i = 0;i < second.size();i++)
    {
        if(x == x+1)
        {
            count++;
        }
        else
            return;
    }
    inFile.close();
}

I'm sorry I don't know why it did that at the beginning of my code. — Chris, May 23 '11 at 22:54
Did you actually ask a question? Keeping a counter? Also, I'd be quite tempted to use sort|uniq -c (possibly with a sed to split multiword lines) if this was for something other than homework. I might use that design pattern when writing a solution, though a hash table would be more efficient if more complicated. — Seth Robertson, May 23 '11 at 22:57
You should think about **why** you are `return`ing in the `for` loop in `main()`. — Matt Ball, May 23 '11 at 23:01
Would it work if I read the first word into a new struct and set occurrences to 1, push into vector. read the second word, if it is in the vector update that words occurrences else, new struct, push. repeat. When done sort by the field, occurrences. — Chris, May 23 '11 at 23:09
This is my firs time on this site how do I post code in a comment? All I see is `code` — Chris, May 23 '11 at 23:15
@Chris, You might miss the last line of text in the input field. You should use `while (!infile.fail())` instead `while (!infile.eof())`. The last successful call to getline might set the eof flag. It's better to you fail() because it will only return true after the first unsuccessful call to getline. — Aaron McDaid, May 23 '11 at 23:23

sehe · Answer 1 · 2011-05-24T16:08:39.977

He. I know bluntly showing a solution is not really helping you. However.

I glanced through your code and saw many unused and confused bits. Here's what I'd do:

#include <algorithm>
#include <fstream>
#include <functional>
#include <iostream>
#include <iterator>
#include <map>
#include <string>
#include <vector>

using namespace std;

// types
typedef std::pair<string, size_t> frequency_t;
typedef std::vector<frequency_t> words_t;

// predicates
static bool byDescendingFrequency(const frequency_t& a, const frequency_t& b)
{ return a.second > b.second; }

const struct isGTE // greater than or equal
{ 
    size_t inclusive_threshold;
    bool operator()(const frequency_t& record) const 
        { return record.second >= inclusive_threshold; }
} over1000 = { 1001 }, over800  = { 800 };

int main() 
{
    words_t words;
    {
        map<string, size_t> tally;

        ifstream inFile("bible.txt");
        string s;
        while (inFile >> s)
            tally[s]++;

        remove_copy_if(tally.begin(), tally.end(), 
                       back_inserter(words), over1000);
    }

    words_t::iterator begin = words.begin(),
                      end = partition(begin, words.end(), over800);
    std::sort(begin, end, &byDescendingFrequency);

    for (words_t::const_iterator it=begin; it!=end; it++)
        cout << it->second << "\t" << it->first << endl;
}

Authorized Verion:

993 because
981 men
967 day
954 over
953 God,
910 she
895 among
894 these
886 did
873 put
868 thine
864 hand
853 great
847 sons
846 brought
845 down
819 you,
811 so

Vulgata:

995 tuum
993 filius
993 nec
966 suum
949 meum
930 sum
919 suis
907 contra
902 dicens
879 tui
872 quid
865 Domine
863 Hierusalem
859 suam
839 suo
835 ipse
825 omnis
811 erant
802 se

Performance is about 1.12s for for both files, but only 0.355s after drop-in replacing map<> with boost::unordered_map<>

@Chris: anytime. Btw, it's customary to use the voting buttons instead of saying thanks in the comments :) (e.g [related discussion](http://meta.stackexchange.com/questions/21258/asking-upvote-for-accepted-answer)) — sehe, May 24 '11 at 00:20

score 2 · Answer 2 · answered May 23 '11 at 23:10

2

A more efficient approach can be done with a single map< string, int > of occurrences, read words one by one, and increment the counter in m[ word ]. After all words have been accounted for, iterate over the map, for words in the given range, add them to a multimap<int, string>. Finally dump the contents of the multimap, that will be ordered by number of occurrences and alphabetical order...

answered May 23 '11 at 23:10

David Rodríguez - dribeas

204,818
23
294
489

How can I do this I am not familiar with map? – Chris May 23 '11 at 23:18
It's as simple as: `#include ; map m; while (/*get word as string*/) { m[word]++; }` It will be very easy to find examples of the `map` online, I suggest you look it up. – Aaron McDaid May 23 '11 at 23:26

Nawaz · Accepted Answer · 2011-05-23T23:27:05.690

2

One solution could be this : define letter_only locale so as to ignore punctuations coming from the stream, and to read only valid "english" letters from the input stream. That way, the stream will treat the words "ways", "ways." and "ways!" as just the same word "ways", because the stream will ignore punctuations like "." and "!".

struct letter_only: std::ctype<char> 
{
    letter_only(): std::ctype<char>(get_table()) {}

    static std::ctype_base::mask const* get_table()
    {
        static std::vector<std::ctype_base::mask> 
            rc(std::ctype<char>::table_size,std::ctype_base::space);

        std::fill(&rc['A'], &rc['z'+1], std::ctype_base::alpha);
        return &rc[0];
    }
};

And then use it as:

int main()
{
     std::map<std::string, int> wordCount;
     ifstream input;

     //enable reading only english letters only!
     input.imbue(std::locale(std::locale(), new letter_only())); 

     input.open("filename.txt");
     std::string word;
     std::string uppercase_word;
     while(input >> word)
     {
         std::transform(word.begin(), 
                        word.end(), 
                        std::back_inserter(uppercase_word),
                        (int(&)(int))std::toupper); //the cast is needed!
         ++wordCount[uppercase_word];
     }
     for (std::map<std::string, int>::iterator it = wordCount.begin(); 
                                               it != wordCount.end(); 
                                               ++it)
     {
           std::cout << "word = "<< it->first 
                     <<" : count = "<< it->second << std::endl;
     }
}

edited May 23 '11 at 23:27

answered May 23 '11 at 23:21

Nawaz

353,942
115
666
851

@Nawaz, what is this printing out? – Chris May 23 '11 at 23:38
@Chris: this prints the words and their counts. `'it'` is an iterator which is used to iterate through the map's element which is a pair of two items: `string`, `int` (i.e word, count). `it->first` is the word, and `it->second` is the count. Why don't you run this code and see the output? – Nawaz May 23 '11 at 23:47
@Nawaz, I'm getting an error for the line back_inserter(uppercase_word)... I took out all of the std:: and just put using namespace std; at the top – Chris May 23 '11 at 23:49
Also, how would I make it so it is just printing out counts between 800 and 1000? – Chris May 23 '11 at 23:49
@Chris: What is the error? Also, what happened when you just wrote `using namespace std` at the top? – Nawaz May 23 '11 at 23:50
@Chris: Just put an `if` condition before the `cout` as `if(it->second >= 800 && it->second <= 1000)` – Nawaz May 23 '11 at 23:51
@Nawaz, doesn't using namespace std make it so you don't have to put all of the std:: in front of everything? And that line is saying back_inserter is unidentified – Chris May 23 '11 at 23:54
@Chris: You've to `#include `, because `std::back_inserter` is defined in it. – Nawaz May 23 '11 at 23:56
@Nawaz, I am getting a no newline at end of file warning and it won't run. word.cpp:53:2: warning: no newline at end of file. – Chris May 24 '11 at 00:03
@Chris: Obviously, there is some syntax problem in your code. You can fix such problems yourself, by formatting the code, and looking for the syntax error. – Nawaz May 24 '11 at 00:05
@Chris: Did you include all these headers, ``, ``, ``, ``, ``? – Nawaz May 24 '11 at 00:09
@Chris: If worked, then you can accept the solution by clicking on the `tick` mark on the answer. – Nawaz May 24 '11 at 00:11
1

@Nawaz: You totally deserved my upvote for actually imbuing a ctype in the wild. It's the first time I've seen that outside of the Josuttis book! – sehe May 24 '11 at 00:26
@sehe: Oh really. thank you. for more usage you can see my blog : http://thecodingpoint.blogspot.com/2011/04/elegant-ways-to-tokenize-strings.html – Nawaz May 24 '11 at 01:06
@Nawaz: You may want to review that post (the one in the link), it crashes quite badly. Also regarding this question I don't think it is a good idea to give a complete code solution to a question tagged as homework, and moreover using features that as others commented before are hardly ever used. – David Rodríguez - dribeas May 24 '11 at 07:21
@David: the link to my blog? Its working for me. http://ideone.com/aKL0m ... Or you're referring me to some other post? :-/ – Nawaz May 24 '11 at 07:21
@David: Yeah. I agree with the homework comment. I actually did not see the tag when I posted the solution. Also because it didn't take any effort to post it here, as I had the solution already in my blog. But I will keep this in mind from next time. :-) – Nawaz May 24 '11 at 07:24
If you try running the code in your blog in MacOSX it will crash. The fact that it works in a particular environment does not mean that it is sound and will work in all other environments. Basically the call to `classic_table()` will return a pointer to the table in the "C" locale, and that locale might or not be present in your current implementation. It seems that it is not present in MacOSX. The code fails both with g++4.2 and g++4.6. – David Rodríguez - dribeas May 24 '11 at 08:04
@David: That is a valuable comment. Thanks for it. I will look into the code once I reach home (I'm in office right now, can't spend that much time on blog). – Nawaz May 24 '11 at 08:21
@Nawaz: Your facet is also mildly flawed. For example, on a typical implementation it will treat `[` and `]` as letters because in ASCII, ISO 8859, Unicode, etc., they're located between upper and lower-case letters. – Jerry Coffin May 24 '11 at 19:03

score 0 · Answer 4 · edited May 23 '17 at 12:06

Just for fun, I did a solution in c++0x style, using Boost MultiIndex.

This style would be quite clumsy without the auto keyword (type inference).

By maintaining the indexes by word and by frequency at all times, there is no need to remove, partition, nor sort the wordlist: it'll all be there.

To compile and run:

g++ --std=c++0x -O3 test.cpp -o test
curl ftp://ftp.funet.fi/pub/doc/bible/texts/english/av.tar.gz |
    tar xzO | sed 's/^[ 0-9:]\+//' > bible.txt
time ./test

.

#include <boost/foreach.hpp>
#include <boost/lambda/lambda.hpp>
#include <boost/multi_index_container.hpp>
#include <boost/multi_index/ordered_index.hpp>
#include <boost/multi_index/member.hpp>
#include <fstream>
#include <iostream>
#include <string>

using namespace std;

struct entry 
{
    string word;
    size_t freq;
    void increment() { freq++; }
};

struct byword {}; // TAG
struct byfreq {}; // TAG

int main() 
{
    using ::boost::lambda::_1;
    using namespace ::boost::multi_index;
    multi_index_container<entry, indexed_by< // sequenced<>,
            ordered_unique    <tag<byword>, member<entry,string,&entry::word> >, // alphabetically
            ordered_non_unique<tag<byfreq>, member<entry,size_t,&entry::freq> > // by frequency
                > > tally;

    ifstream inFile("bible.txt");
    string s;
    while (inFile>>s)
    {
        auto& lookup = tally.get<byword>();
        auto it = lookup.find(s);

        if (lookup.end() != it)
            lookup.modify(it, boost::bind(&entry::increment, _1));
        else
            lookup.insert({s, 1});
    }

    BOOST_FOREACH(auto e, tally.get<byfreq>().range(800 <= _1, _1 <= 1000))
        cout << e.freq << "\t" << e.word << endl;

}

Note how

it became just slightly more convenient to define a custom entry type instead of using std::pair
(for obvious reasons), this is slower than my earlier code: this maintains the index by frequency during the insertion phase. This is unnecessary, but it makes for much more efficient extraction of the [800,1000] range:

tally.get<byfreq>().range(800 <= _1, _1 <= 1000)

The multi-set of frequencies is already ordered. So, the actual speed/memory trade of might tip in the favour of this version, especially when documents would be large and contain very few duplicated words (alas, this is a property known not to hold for the corpus text of the bible, lest someone translate it to neologorrhea).

score 0 · Answer 5 · answered Dec 05 '22 at 12:29

With more modern features available in C++ 20, we can now give an improved answer. By using newly available STL container, we can achive the whole task which just a few statements.

There is nearly a universal solution approach for "counting". We can use the std::unordered_map. It is described in the C++ reference here.

It is the std::unordered_maps very convienient index operator [] which makes counting very simple. This operator returns a reference to the value that is mapped to a key. So, it searched for the key and then returns the value. If the key does not exist, it inserts a new key/value pair and returns a reference to the value.

So, in any way, a reference to the value is returned. And this can be incremented. Example:

With a "std::unordered_map<char, int> mymap{};" and a text "aba", the follwoing can be done with the index operator:

mymap['a'] will search for an 'a' in the map. It is not found, so a new entry for 'a' with corresponding value=0 is created: The a reference to the value is returned. And, if we now increment that, we will increment the counter value. So, mymap['a']++, wiil insert a new gey/value pair 'a'/0, then increment it, resulting in 'a'/1
For 'b' the same as above will happen.
For the next 'a', an entry will be found in the map, an so a reference to the value (1) is returned. This is incremented and will then be 2

And so on and so on.

Of course we will use a std::string as key in our example.

Sorting is similar simple. We just insert everything in a std::multimap, with key and value exchanged.Then everything is sorted according to the frequency.

With the very convienient lower_bound and upper_bound function of the std::multimap we can find and display the requested values easily.

All this will result in a compact code example:

#include <iostream>
#include <fstream>
#include <string>
#include <unordered_map>
#include <map>
#include <iterator>
#include <ranges>
#include <iomanip>

namespace rng = std::ranges;                // Abbreviation for the ranges namespace

int main() {

    // Open bible text file and check, if it could be opened
    if (std::ifstream ifs{ "r:\\bible.txt" }; ifs) {
        
        // Read all words from file and count them
        std::unordered_map<std::string, std::size_t> counter{};
        for (const auto& word : rng::istream_view<std::string>(ifs)) counter[word]++;

        // Sort the words according to their frequency
        std::multimap<std::size_t, std::string> sorter{};
        for (const auto& [word, count] : counter) sorter.insert({ count, word });

        // Show words with frequency given in a certain range
        for (const auto& [count, word] : rng::subrange{ sorter.lower_bound(800), sorter.upper_bound(1000) })
            std::cout << std::setw(25) << word << " --> " << count << '\n';
    }
    else std::cerr << "\n*** Error: Could not open source text file\n";
}

Count the number of times each word occurs in a file

5 Answers5

Linked

Related