23

I am attempting to find tags(keywords) for a recipe by parsing a long string of text. The text contains the recipe ingredients, directions and a short blurb.

What do you think would be the most efficient way to remove common words from the tag list?

By common words, I mean words like: 'the', 'at', 'there', 'their' etc.

I have 2 methodologies I can use, which do you think is more efficient in terms of speed and do you know of a more efficient way I could do this?

Methodology 1:
- Determine the number of times each word occurs(using the library Collections)
- Have a list of common words and remove all 'Common Words' from the Collection object by attempting to delete that key from the Collection object if it exists.
- Therefore the speed will be determined by the length of the variable delims

import collections from Counter
delim     = ['there','there\'s','theres','they','they\'re'] 
# the above will end up being a really long list!
word_freq = Counter(recipe_str.lower().split())
for delim in set(delims):
    del word_freq[delim]
return freq.most_common()

Methodology 2:
- For common words that can be plural, look at each word in the recipe string, and check if it partially contains the non-plural version of a common word. Eg; For the string "There's a test" check each word to see if it contains "there" and delete it if it does.

delim         = ['this','at','them'] # words that cant be plural
partial_delim = ['there','they',] # words that could occur in many forms
word_freq     = Counter(recipe_str.lower().split())
for delim in set(delims):
    del word_freq[delim]
# really slow 
for delim in set(partial_delims):
    for word in word_freq:
        if word.find(delim) != -1:
           del word_freq[delim]
return freq.most_common()
sazr
  • 24,984
  • 66
  • 194
  • 362
  • 4
    I can't give you a full answer here, but I did want to mention something that could help you. When doing any kind of textual analysis, you generally want to be able to treat pluralizations, conjugations, and other such transformations as all pertaining to the same 'word'. For example: you may want to treat delimit, delimits, delimited, delimiter, delimiters...as the same. This process is called 'stemming' and there are a number of well-researched algorithms, with examples from a variety of programming languages, that will attempt to do this for you. Good luck! Wish I could be more help. – soundslikeneon Mar 31 '12 at 06:38
  • 2
    Actually, I'd recommend pretty much the opposite. If the intent is purely to "remove" pre-defined "common" words, I'd just create the full list, complete with all the variations of each word that you want removed. Don't worry about which words are plurals of which. Much simpler/faster/less error-prone. – Hot Licks Apr 07 '12 at 23:33

3 Answers3

34

I'd just do something like this:

from nltk.corpus import stopwords
s=set(stopwords.words('english'))

txt="a long string of text about him and her"
print filter(lambda w: not w in s,txt.split())

which prints

['long', 'string', 'text']

and in terms of complexity should be O(n) in number of words in the string, if you believe the hashed set lookup is O(1).

FWIW, my version of NLTK defines 127 stopwords:

'all', 'just', 'being', 'over', 'both', 'through', 'yourselves', 'its', 'before', 'herself', 'had', 'should', 'to', 'only', 'under', 'ours', 'has', 'do', 'them', 'his', 'very', 'they', 'not', 'during', 'now', 'him', 'nor', 'did', 'this', 'she', 'each', 'further', 'where', 'few', 'because', 'doing', 'some', 'are', 'our', 'ourselves', 'out', 'what', 'for', 'while', 'does', 'above', 'between', 't', 'be', 'we', 'who', 'were', 'here', 'hers', 'by', 'on', 'about', 'of', 'against', 's', 'or', 'own', 'into', 'yourself', 'down', 'your', 'from', 'her', 'their', 'there', 'been', 'whom', 'too', 'themselves', 'was', 'until', 'more', 'himself', 'that', 'but', 'don', 'with', 'than', 'those', 'he', 'me', 'myself', 'these', 'up', 'will', 'below', 'can', 'theirs', 'my', 'and', 'then', 'is', 'am', 'it', 'an', 'as', 'itself', 'at', 'have', 'in', 'any', 'if', 'again', 'no', 'when', 'same', 'how', 'other', 'which', 'you', 'after', 'most', 'such', 'why', 'a', 'off', 'i', 'yours', 'so', 'the', 'having', 'once'

obviously you can provide your own set; I'm in agreement with the comment on your question that it's probably easiest (and fastest) to just provide all the variations you want to eliminate up front, unless you want to eliminate a lot more words than this but then it becomes more a question of spotting interesting ones than eliminating spurious ones.

timday
  • 24,582
  • 12
  • 83
  • 135
13

Your problem domain is "Natural Language Processing".

If you don't want to reinvent the wheel, use NLTK, search for stemming in the docs.

Given that NLP is one of the hardest subjects in computer science, reinventing this wheel is a lot of work...

Paulo Scardine
  • 73,447
  • 11
  • 124
  • 153
1

You ask about speed, but you should be more concerned with accuracy. Both your suggestions will make a lot of mistakes, removing either too much or too little (for example, there are a lot of words that contain the substring "at"). I second the suggestion to look into the nltk module. In fact, one of the early examples in the NLTK book involves removing common words until the most common remaining ones reveal something about the genre. You'll get not only tools, but instruction on how to go about it.

Anyway you'll spend much longer writing your program than your computer will spend executing it, so focus on doing it well.

alexis
  • 48,685
  • 16
  • 101
  • 161