Remove excess junk words from string or array of strings

Question

I have millions of arrays that each contain about five strings. I am trying to remove all of the "junk words" (for lack of a better description) from the arrays, such as all articles of speech, words like "to", "and", "or", "the", "a" and so on.

For example, one of my arrays has these six strings:

"14000"
"Things"
"to"
"Be"
"Happy"
"About"

I want to remove the "to" from the array.

One solution is to do:

excess_words = ["to","and","or","the","a"]
cleaned_array = dirty_array.reject {|term| excess_words.include? term}

But I am hoping to avoid manually typing every excess word. Does anyone know of a Rails function or helper that would help in this process? Or perhaps an array of "junk words" already written?

score 4 · Accepted Answer · edited May 23 '17 at 12:17

Dealing with stopwords is easy, but I'd suggest you do it BEFORE you split the string into the component words.

Building a fairly simple regular expression can make short work of the words:

STOPWORDS = /\b(?:#{ %w[to and or the a].join('|') })\b/i
# => /\b(?:to|and|or|the|a)\b/i

clean_string = 'to into and sandbar or forest the thesis a algebra'.gsub(STOPWORDS, '')
# => " into  sandbar  forest  thesis  algebra"

clean_string.split
# => ["into", "sandbar", "forest", "thesis", "algebra"]

How do you handle them if you get them already split? I'd join(' ') the array to turn it back into a string, then run the above code, which returns the array again.

incoming_array = [
  "14000",
  "Things",
  "to",
  "Be",
  "Happy",
  "About",
]

STOPWORDS = /\b(?:#{ %w[to and or the a].join('|') })\b/i
# => /\b(?:to|and|or|the|a)\b/i

incoming_array = incoming_array.join(' ').gsub(STOPWORDS, '').split
# => ["14000", "Things", "Be", "Happy", "About"]

You could try to use Array's set operations, but you'll run afoul of the case sensitivity of the words, forcing you to iterate over the stopwords and the arrays which will run a LOT slower.

Take a look at these two answers for some added tips on how you can build very powerful patterns making it easy to match thousands of strings:

Interesting, I had not considered that the a regex would be faster. I will use that method instead. Thanks! — Mike S, Jan 07 '15 at 18:51
regex aren't faster for simple string lookups, but when you're dealing with a number of them then patterns suddenly have an advantage. There can be gotchas, for more complex situations, but this is pretty straightforward. I used to do exactly this sort of thing a lot in Perl, and found it was a lot faster. — the Tin Man, Jan 07 '15 at 20:04
See the added links in the answer about generating and using complex patterns. — the Tin Man, Jan 07 '15 at 20:08

score 2 · Answer 2 · answered Jan 07 '15 at 17:31

2

All you need is a list of English stopwords. You can find it here, or google for 'english stopwords list'

answered Jan 07 '15 at 17:31

Grych

2,861
13
22

Perfect! I was not aware of the term "stopwords", thanks a lot – Mike S Jan 07 '15 at 17:34

Remove excess junk words from string or array of strings

2 Answers2