0

I am trying to process the visible text of very large pages and, as an example, the whole of Orwell's "1984" on this page, but it seems my Chrome console is crashing when I try the following operation.

var script = document.createElement('script');
script.src = "https://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js";
document.getElementsByTagName('head')[0].appendChild(script);
var allWords = $(document.body).children(":visible").text().split(' ');
var uniqueWords = allWords.filter(function(elem, i, array){ return array.indexOf(elem) === i });

The above makes my Chrome tab become unresponsive at the last operation (I stop getting output for new commands I enter for at least a minute). Note: the first part of the snippet just attaches JQuery to the page.

How would you try to process large strings like this much, much faster? Do you think I should randomly sample from allWords and only apply the filter function to this smaller string.

Community
  • 1
  • 1
under_the_sea_salad
  • 1,754
  • 3
  • 22
  • 42

1 Answers1

1

The reason why chrome tab is hanging after last line executes is complexity of your algorithm. Instead of calling .indexOf on each word you can just add each word to a Set

var uniqueWords = new Set();
allWords.forEach(function (word) { 
    uniqueWords.add(word) 
});

If you need ES5 version of the same code, you can use helper object as data storage. Object keys are unique by nature, so you can fill empty object with words as keys and whatever you want with values and then extract words with Object.keys method

var uniqueWordsHash = {};
allWords.reduce(function (hash, word) {
    hash[word] = null;
    return hash;
}, uniqueWordsHash);

var uniqueWordsArray = Object.keys(uniqueWordsHash);
Andrei Lesnitsky
  • 1,038
  • 7
  • 14
  • That's excellent. I should add that I needed to do something like `Array.from(uniqueWords)` to actually get back an array which is what I was after. – under_the_sea_salad Dec 26 '15 at 23:57