How to get an Array of all words used on a page

Question

So I'm trying to get an array of all the words used in my web page.

Should be easy, right?

The problem I run into is that $("body").text().split(" ") returns an array where the words at the beginning of one element and end of another are joined as one.

i.e:

<div id="1">Hello
    <div id="2">World</div>
</div>

returns ["HelloWorld"] when I want it to return ["Hello", "World"].

I also tried:

wordArr = [];

function getText(target)
{    
    if($(this).children())
    {
        $(this).children(function(){getText(this)});
    }
    else
    {
        var testArr = $(this).text().split(" ");
        for(var i =0; i < testArr.length; i++)
            wordArr.push(testArr[i]);
    }

}

getText("body");

but $(node).children() is truthy for any node in the DOM that exists, so that didn't work.

I'm sure I'm missing something obvious, so I'd appreciate an extra set of eyes.

For what it's worth, I don't need unique words, just every word in the body of the document as an element in the array. I'm trying to use it to generate context and lexical co-occurrence with another set of words, so duplicates just up the contextual importance of a given word.

Thanks in advance for any ideas.

See Fiddle

see here http://stackoverflow.com/questions/298750/how-do-i-select-text-nodes-with-jquery it might be helpful to you — PSR, Jun 03 '13 at 21:58

PSL · Accepted Answer · 2013-06-04T04:04:36.143

7

How about something like this?

 var res = $('body  *').contents().map(function () {
    if (this.nodeType == 3 && this.nodeValue.trim() != "") 
        return this.nodeValue.trim();
}).get().join(" ");
console.log(res);

Demo

Get the array of words:

var res = $('body  *').contents().map(function () {
    if (this.nodeType == 3 && this.nodeValue.trim() != "") //check for nodetype text and ignore empty text nodes
        return this.nodeValue.trim().split(/\W+/);  //split the nodevalue to get words.
}).get(); //get the array of words.

console.log(res);

Demo

edited Jun 04 '13 at 04:04

answered Jun 03 '13 at 21:59

PSL

123,204
21
253
243

this won't work for non-text nodes (`nodeType != 3`), such as `missing text` – yonilevy Jun 03 '13 at 22:06
1

@yonilevy: `missing text` is a text node. – Felix Kling Jun 03 '13 at 22:10
Each text node could still contain multiple words. You probably want to split each node on white spaces. Fortunately, `.map` flattens returned arrays into the final array, so all you really have to do is split. I don't necessarily like selecting all elements, but the alternative would be some recursion, which could nest deeply. – Felix Kling Jun 03 '13 at 22:11
@PSL ,that worked great. Just had to add a :not("script") to keep the non-html out. Thanks! – Jason Nichols Jun 04 '13 at 14:55
1

@JasonNichols oh yes if you have script on body.. :) – PSL Jun 04 '13 at 15:03

adeneo · Answer 2 · 2013-06-03T22:25:44.960

3

function getText(target) {
    var wordArr = [];
    $('*',target).add(target).each(function(k,v) {
        var words  = $('*',v.cloneNode(true)).remove().end().text().split(/(\s+|\n)/);
        wordArr = wordArr.concat(words.filter(function(n){return n.trim()}));
    });
    return wordArr;
}

FIDDLE

edited Jun 03 '13 at 22:25

answered Jun 03 '13 at 22:12

adeneo

312,895
29
395
388

What about words that are separated by by elements, e.g. Fred? – RobG Jun 03 '13 at 22:54
@RobG - didn't the `` tag die already? That's what css is for! – adeneo Jun 03 '13 at 22:58
2

Adeneo—CSS will not save you: `Fred`. – RobG Jun 04 '13 at 00:09
This also worked. The main problem I was having was with elements separated by form elements, br, and p tags concatenating. @RobG you raise an interesting point. Will look at your answer next. – Jason Nichols Jun 04 '13 at 14:59

score 1 · Answer 3 · answered Jun 03 '13 at 23:02

1

you can do this

function getwords(e){
    e.contents().each(function(){
        if ( $(this).children().length > 0 ) {
            getwords($(this))
        }
        else if($.trim($(this).text())!=""){
            words=words.concat($.trim($(this).text()).split(/\W+/))
        }
    });
}

http://jsfiddle.net/R55eM/

answered Jun 03 '13 at 23:02

Abraham Uribe

3,118
7
26
34

Great Answer. Still reading all the answers before I accept one, but this worked nicely, and I thought it was aesthetically very nice code too. Thanks! – Jason Nichols Jun 04 '13 at 14:32

RobG · Answer 4 · 2013-06-04T22:12:53.393

The question assumes that words are not internally separated by elements. If you simply create an array of words separated by white space and elements, you will end up with:

Fr<b>e</b>d

being read as

['Fr', 'e', 'd'];

Another thing to consider is punctuation. How do you deal with: "There were three of them: Mark, Sue and Tom. They were un-remarkable. One—the red head—was in the middle." Do you remove all punctuation? Or replace it with white space before trimming? How do you re-join words that are split by markup or characters that might be inter–word or intra–word punctuation? Note that while it is popular to write a dash between words with a space at either side, "correct" punctuation uses an m dash with no spaces.

Not so simple…

Anyhow, an approach that just splits on spaces and elements using recursion and works in any browser in use without any library support is:

function getWords(element) {
  element = element || document.body;
  var node, nodes = element.childNodes;
  var words = [];
  var text, i=0;

    while (node = nodes[i++]) {

    if (node.nodeType == 1) {
      words = words.concat(getWords(node));

    } else if (node.nodeType == 3) {
      text = node.data.replace(/^\s+|\s+$/g,'').replace(/\s+/g,' ');
      words = !text.length? words : words.concat(text.split(/\s/));
    }
  }
  return words;
}

but it does not deal with the issues above.

Edit

To avoid script elements, change:

    if (node.nodeType == 1) {

to

    if (node.nodeType == 1 && node.tagName.toLowerCase() != 'script') {

Any element that should be avoided can be added to the condition. If a number of element types should be avoided, you can do:

var elementsToAvoid = {script:'script', button:'button'};
...
    if (node.nodeType == 1 && node.tagName && !(node.tagName.toLowerCase() in elementsToAvoid)) {

This approach also works, but without using jQuery it would be harder to filter out script tags (like Google Analytics which is traditionally in the body). I just nested everything inside the while loop in a `if($(this).parent().is(":not('script')))` statement, and it ran fine. Thanks! — Jason Nichols, Jun 04 '13 at 15:09
Any ideas on how to address the issues you raised. Any one of these solutions is sufficient for current needs, but definitely interested in any feedback on punctuation. Less concerned about words where part of the word is styled. They should be infrequent enough to not effect statistical analysis. — Jason Nichols, Jun 04 '13 at 15:11

How to get an Array of all words used on a page

4 Answers4

Demo

Demo

Edit

Linked