1

I'm reading in several documents, and indexing the words I read in. However, I want to ignore common case words (a, an, the, and, is, or, are, etc).

Is there a shortcut to doing this? Moreso than doing just...

if(word=="and" || word=="is" || etc etc....) ignore word;

For example, can I put them into a const string somehow, and have it just check against the string? Not sure... thank you!

jli
  • 6,523
  • 2
  • 29
  • 37
Heather Wilson
  • 153
  • 2
  • 4
  • 13
  • 1
    search for 'stop' words... http://databases.aspfaq.com/database/how-do-i-ignore-common-words-in-a-search.html – Mitch Wheat Apr 15 '12 at 00:44

3 Answers3

5

Create a set<string> with the words that you would like to exclude, and use mySet.count(word) to determine if the word is in the set. If it is, the count will be 1; it will be 0 otherwise.

#include <iostream>
#include <set>
#include <string>
using namespace std;

int main() {
    const char *words[] = {"a", "an", "the"};
    set<string> wordSet(words, words+3);
    cerr << wordSet.count("the") << endl;
    cerr << wordSet.count("quick") << endl;
    return 0;
}
Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • 1
    And with C++11, you can even write `set words {"and", "is", ...}`. – Philipp Apr 15 '12 at 00:48
  • This is basically the right answer, but consider the arguments in [Why you shouldn't use set (and what you should use instead)](http://lafstern.org/matt/col1.pdf). – bames53 Apr 15 '12 at 00:50
  • @bames53 That's an interesting argument, but it does not argue that the set is bad, just that there are things that are more economical. I think in this case the use of set is OK: the improvement from replacing it with a sorted vector would be marginal, but explaining the change would take a lot of keystrokes. – Sergey Kalinichenko Apr 15 '12 at 00:59
  • Some empirical testing for this argument: http://www.umich.edu/~eecs381/handouts/FillErUp.pdf – keelerjr12 Apr 15 '12 at 01:13
1

You can use an array of strings, looping through and matching against each, or use a more optimal data structure such as a set, or trie.

Here's an example of how to do it with a normal array:

const char *commonWords[] = {"and", "is" ...};
int commonWordsLength = 2; // number of words in the array

for (int i = 0; i < commonWordsLength; ++i)
{
    if (!strcmp(word, commonWords[i]))
    {
        //ignore word;
        break;
    }
}

Note that this example doesn't use the C++ STL, but you should.

jli
  • 6,523
  • 2
  • 29
  • 37
0

If you want to maximize performance you should create a trie....

http://en.wikipedia.org/wiki/Trie

...of stopwords....

http://en.wikipedia.org/wiki/Stop_words

There is no standard C++ trie datastructure, however see this question for third party implementations...

Trie implementation

If you can't be bothered with that and want to use a standard container, the best one to use is unordered_set<string> which will put the stopwords in a hash table.

bool filter(const string& word)
{
    static unordered_set<string> stopwords({"a", "an", "the"});
    return !stopwords.count(word);
}
Community
  • 1
  • 1
Andrew Tomazos
  • 66,139
  • 40
  • 186
  • 319