12

What are the elegant and effective ways to count the frequency of each "english" word in a file?

pintu
  • 705
  • 2
  • 6
  • 8

8 Answers8

16

First of all, I define letter_only std::locale so as to ignore punctuations coming from the stream, and to read only valid "english" letters from the input stream. That way, the stream will treat the words "ways", "ways." and "ways!" as just the same word "ways", because the stream will ignore punctuations like "." and "!".

struct letter_only: std::ctype<char> 
{
    letter_only(): std::ctype<char>(get_table()) {}

    static std::ctype_base::mask const* get_table()
    {
        static std::vector<std::ctype_base::mask> 
            rc(std::ctype<char>::table_size,std::ctype_base::space);

        std::fill(&rc['A'], &rc['z'+1], std::ctype_base::alpha);
        return &rc[0];
    }
};

Solution 1

int main()
{
     std::map<std::string, int> wordCount;
     ifstream input;
     input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only letters!
     input.open("filename.txt");
     std::string word;
     while(input >> word)
     {
         ++wordCount[word];
     }
     for (std::map<std::string, int>::iterator it = wordCount.begin(); it != wordCount.end(); ++it)
     {
           cout << it->first <<" : "<< it->second << endl;
     }
}

Solution 2

struct Counter
{
    std::map<std::string, int> wordCount;
    void operator()(const std::string & item) { ++wordCount[item]; }
    operator std::map<std::string, int>() { return wordCount; }
};

int main()
{
     ifstream input;
     input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only letters!
     input.open("filename.txt");
     istream_iterator<string> start(input);
     istream_iterator<string> end;
     std::map<std::string, int> wordCount = std::for_each(start, end, Counter());
     for (std::map<std::string, int>::iterator it = wordCount.begin(); it != wordCount.end(); ++it)
     {
          cout << it->first <<" : "<< it->second << endl;
     }
 }
Nawaz
  • 353,942
  • 115
  • 666
  • 851
  • However, the answer made it also clear that "non-whitespace character sequences delimited by whitespace" is not the definition of "word" the OP is after. – sbi Feb 03 '11 at 16:51
  • 1
    I think this is the right answer since he want the frequency of repeated words. – Murilo Vasconcelos Feb 03 '11 at 16:52
  • 1
    The input loop in the first solution is wrong. The eof flag is set _after_ an input operation that fails due to reaching eof. – James McNellis Feb 03 '11 at 16:56
  • Again, this is __not__ the right answer. The OP is __not__ asking for whitespace-delimited words. This would count `"end_of_sentence."` and `"end_of_sentence!"` as __two distinct words__, which is __not what the OP wants__. – sbi Feb 03 '11 at 16:58
  • @Nawaz for fully correct answer you probably need to clear tokens from punctuations. – UmmaGumma Feb 03 '11 at 16:58
  • 1
    @Nawaz: Why not just use the idiomatic `while (input >> word)`? It's still wrong as written since the other flags aren't checked. – James McNellis Feb 03 '11 at 17:23
  • That first loop is ugly. Though you have correct usage; doing it like that is error prone (fro beginners) we should be encouraging new-bes to put the read into the while condition. `while (input >> word)` – Martin York Feb 03 '11 at 17:25
  • While you are cleaning up. Wy not wrap it all in main() so that a beginner can just cut and paste the code and get a working program. – Martin York Feb 03 '11 at 17:27
  • @James: I did that. Also worked on locale also. Please check it out. and let me know if you find anything wrong with the solution! – Nawaz Feb 03 '11 at 17:34
  • @sbi, and @Ashot : I fixed that by defining `letter_only` locale. Please check it out! – Nawaz Feb 03 '11 at 17:35
  • @Martin: Please check out the complete solution now. :-) – Nawaz Feb 03 '11 at 17:37
  • @Nawaz: imbue() on a file that is already opened silently fails. You need to imbue the file object before you actually open the file (i.e. before providing the filename). – Martin York Feb 03 '11 at 17:45
  • @Martin: It doesn't fail. I checked it. `locale` is used when reading file, so it makes sense! – Nawaz Feb 03 '11 at 17:47
  • @Nawaz: Let me re-phrase. It will not work on all implementations. You need to declare the object. imbue() then open(filename). This is done because some locals use characters from the stream to define how they work (like the BOM marker in UTF-8/16 files can be extracted by the local object to define how they work. As a result the usual action of imbue is to fail if the file is already open). – Martin York Feb 03 '11 at 17:48
  • @Martin: That is interesting. Can you explain me why? Any reference for that? I didn't get anything of that sort! – Nawaz Feb 03 '11 at 17:51
  • @Nawaz: Just stepping through the code. In g++ look at bits/fstream.tcc basic_filebuf::imbue() (the rules are a bit more complex but it always works if the file has NOT been opened. It may fail if the file has been opened (it may depend on what the old local did to the file when it was opened). – Martin York Feb 03 '11 at 18:03
  • @Martin: thanks a lot for explaining that. I didn't know that. So I made those changes, as you suggested. Let me know if there is any potential problem. :-) – Nawaz Feb 03 '11 at 18:05
  • @Nawaz I find bug in your code.You need to call tolower for words before adding to map (see my solution). – UmmaGumma Feb 03 '11 at 18:24
4

Perl is arguably not so elegant, but very effective.
I posted a solution here: Processing huge text files

In a nutshell,

1) If needed, strip punctuation and convert uppercase to lowercase:
perl -pe "s/[^a-zA-Z \t\n']/ /g; tr/A-Z/a-z/" file_raw > file

2) Count the occurrence of each word. Print results sorted first by frequency, and then alphabetically:
perl -lane '$h{$_}++ for @F; END{for $w (sort {$h{$b}<=>$h{$a} || $a cmp $b} keys %h) {print "$h{$w}\t$w"}}' file > freq

I ran this code on a 3.3GB text file with 580,000,000 words.
Perl 5.22 completed in under 3 minutes.

Community
  • 1
  • 1
Chris Koknat
  • 3,305
  • 2
  • 29
  • 30
2

Here is working solution.This should work with real text (including punctuation) :

#include <iterator>
#include <iostream>
#include <fstream>
#include <map>
#include <string>
#include <cctype>

std::string getNextToken(std::istream &in)
{
    char c;
    std::string ans="";
    c=in.get();
    while(!std::isalpha(c) && !in.eof())//cleaning non letter charachters
    {
        c=in.get();
    }
    while(std::isalpha(c))
    {
        ans.push_back(std::tolower(c));
        c=in.get();
    }
    return ans;
}

int main()
{
    std::map<std::string,int> words;
    std::ifstream fin("input.txt");

    std::string s;
    std::string empty ="";
    while((s=getNextToken(fin))!=empty )
            ++words[s];

    for(std::map<std::string,int>::iterator iter = words.begin(); iter!=words.end(); ++iter)
        std::cout<<iter->first<<' '<<iter->second<<std::endl;
}

Edit: Now my code calling tolower for every letter.

UmmaGumma
  • 5,633
  • 1
  • 31
  • 45
  • This undoubtedly works for english (which is what the OP asked, It know), but not for other languages. I won't also work if there is a number in the input text. – Baltasarq Feb 03 '11 at 17:36
  • @Baltasarq question asks "english" words.Also is_alpha doesn't return true for digits. – UmmaGumma Feb 03 '11 at 17:47
2

My solution is the following one. Firstly, all symbols are converted to spaces. Then, basically the same solution provided here before is used in order to extract words:

const std::string Symbols = ",;.:-()\t!¡¿?\"[]{}&<>+-*/=#'";
typedef std::map<std::string, unsigned int> WCCollection;
void countWords(const std::string fileName, WCCollection &wcc)
    {
        std::ifstream input( fileName.c_str() );

        if ( input.is_open() ) {
            std::string line;
            std::string word;

            while( std::getline( input, line ) ) {
                // Substitute punctuation symbols with spaces
                for(std::string::const_iterator it = line.begin(); it != line.end(); ++it) {
                    if ( Symbols.find( *it ) != std::string::npos ) {
                        *it = ' ';
                    }

                }

                // Let std::operator>> separate by spaces
                std::istringstream filter( line );
                while( filter >> word ) {
                    ++( wcc[word] );
                }
            }
        }

    }
Baltasarq
  • 12,014
  • 3
  • 38
  • 57
1
  1. Decide on exactly what you mean by "an English word". The definition should cover things like whether "able-bodied" is one word or two, how to handle apostrophes ("Don't trust 'em!"), whether capitalization is significant, etc.

  2. Create a set of test cases so you can be sure you get all the decisions in step 1 correct.

  3. Create a tokenizer that reads the next word (as defined in step 1) from the input and returns it in a standard form. Depending on how your definition, this might be a simple state machine, a regular expression, or just relying on <istream>'s extraction operators (e.g., std::cin >> word;). Test your tokenizer with all the test cases from step 2.

  4. Choose a data structure for keeping the words and counts. In modern C++, you'd probably end up with something like std::map<std::string, unsigned> or std::unordered_map<std::string, int>.

  5. Write a loop that gets the next word from the tokenizer and increments its count in the histogram until there are no more words in the input.

Adrian McCarthy
  • 45,555
  • 16
  • 123
  • 175
1

Pseudocode for an algorithm which I believe to be close to what you want:

counts = defaultdict(int)
for line in file:
  for word in line.split():
    if any(x.isalpha() for x in word):
      counts[word.toupper()] += 1

freq = sorted(((count, word) for word, count in counts.items()), reversed=True)
for count, word in freq:
  print "%d\t%s" % (count, word)

Case-insensitive comparison is handled naïvely and probably combines words you don't want to combine in an absolutely general sense. Be careful of non-ASCII characters in your implementation of the above. False positives may include "1-800-555-TELL", "0xDEADBEEF", and "42 km", depending on what you want. Missed words include "911 emergency services" (I'd probably want that counted as three words).

In short, natural language parsing is hard: you probably can make due with some approximation depending on your actual use case.

Fred Nurk
  • 13,952
  • 4
  • 37
  • 63
  • 2
    A funny way to answer a C++ question: Providing Python code and then declaring it to be pseudocode. Considering, this uses types from the Python stdlib without importing it, and comprehensions, and that any C++ folks reading this have to guess a lot, I'm surprised this got an upvote. Maybe this is a secret experiment to see how many C++ programmers can be silently & unknowingly converted to Python enthusiasts? – cfi Jan 10 '14 at 13:14
0
string mostCommon( string filename ) {

    ifstream input( filename );
    string line;
    string mostFreqUsedWord;
    string token;
    map< string, int > wordFreq;

    if ( input.is_open() ) {

        while ( true ) {
            input >> token;
            if( input ) {
                wordFreq[ token ]++;
                if ( wordFreq[ token] > wordFreq[ mostFreqUsedWord ] )
                    mostFreqUsedWord = token;
            } else
                break;
        }
        input.close();
    } else {
        cout << "Unable to ope file." << endl;
    }
    return mostFreqUsedWord;
}
0

One more simple way is to count the number of spaces in the file till more then one space was found, if you consider only single space between words...

Chirag Tayal
  • 459
  • 1
  • 6
  • 14