How to remove ALL stop words from text?

Question

I'm trying to using this JavaScript code:

var aStopWords = new Array ("a", "the", "blah"...);

(code to make it run, full code can be found here: https://jsfiddle.net/j2kbpdjr/)

// sText is the body of text that the keywords are being extracted from. 
// It's being separated into an array of words.

// remove stop words
for (var m = 0; m < aStopWords.length; m++) {
    sText = sText.replace(' ' + aStopWords[m] + ' ', ' ');
}

to get the keywords from a body of text. It works quite well, however, the issue I'm having is that it only seems to iterate through and ignore one instance of the words in the array aStopWords.

So if I have the following body of text:

how are you today? Are you well?

And I put var aStopWords = new Array("are","well") then it seems it will ignore the first instance of are, but still show the second are as a keyword. Whereas it will completely remove / ignore well from the keywords.

If anyone can help ignore all instances of the words in aStopWords from the keywords, I'd greatly appreciate it.

Is your goal to remove every occurence of a list of words from a text? — ssc-hrep3, Feb 13 '17 at 11:37
Duplicate of: http://stackoverflow.com/questions/21493028/javascript-replace-more-than-one-value — T.J. Crowder, Feb 13 '17 at 11:54

ssc-hrep3 · Accepted Answer · 2017-02-13T12:45:23.743

You can easily do this like this.

First, it splits the text into keywords. Then, it goes through all the keywords. While going through, it checks if it is a stopword. If so, it will be ignored. If not, the occurrence number of this keyword in the result object will be increased.

Then, the keywords are in a JavaScript object in the following form:

{ "this": 1, "that": 2 }

Objects are not sortable in JavaScript, but Arrays are. So, a remapping to the following structure is necessary:

[
    { "keyword": "this", "counter": 1 },
    { "keyword": "that", "counter": 2 }
]

Then, the array can be sorted by using the counter attribute. With the slice() function, only the top X values can be extracted from the sorted list.

var stopwords = ["about", "all", "alone", "also", "am", "and", "as", "at", "because", "before", "beside", "besides", "between", "but", "by", "etc", "for", "i", "of", "on", "other", "others", "so", "than", "that", "though", "to", "too", "trough", "until"];
var text = document.getElementById("main").innerHTML;

var keywords = text.split(/[\s\.;:"]+/);
var keywordsAndCounter = {};
for(var i=0; i<keywords.length; i++) {
  var keyword = keywords[i];
  
  // keyword is not a stopword and not empty
  if(stopwords.indexOf(keyword.toLowerCase()) === -1 && keyword !== "") {
    if(!keywordsAndCounter[keyword]) {
      keywordsAndCounter[keyword] = 0;
    }
    keywordsAndCounter[keyword]++;
  }
}

// remap from { keyword: counter, keyword2: counter2, ... } to [{ "keyword": keyword, "counter": counter }, {...} ] to make it sortable
var result = [];
var nonStopKeywords = Object.keys(keywordsAndCounter);
for(var i=0; i<nonStopKeywords.length; i++) {
  var keyword = nonStopKeywords[i];
  result.push({ "keyword": keyword, "counter": keywordsAndCounter[keyword] });
}

// sort the values according to the number of the counter
result.sort(function(a, b) {
  return b.counter - a.counter;
});

var topFive = result.slice(0, 5);
console.log(topFive);

<div id="main">This is a test to show that it is all about being between others. I am there until 8 pm event though it will be late. Because it is "cold" outside even though it is besides me.</div>

Thank you! That works perfectly for removing all instances of the stop words, there's one issue I'm having with this (sorry to be a pain). The issue is that this is listing ALL of the non-stop words, rather than only the top X number of reoccurring words. — Jack, Feb 13 '17 at 12:24
@Jack, I've updated the answer with the following: The issue is, that an object cannot be sorted, so you need to convert it from an object to an array (containing objects). — ssc-hrep3, Feb 13 '17 at 12:48

How to remove ALL stop words from text?

1 Answers1