0

I am curious to know if one could search the entire DOM with a regex that could then essentially identify the path used to get to the matching node. So in other words, I want to find all matches of a pattern, lets say the words "hello" and from that I want to identify its branching within the DOM or the container parent at least.

Applying the regex match obviously finds the matches, but neglects to keep the context of where they were found within the DOM. Is there a way to override this matching to also print or associate the location of the match? If not (assuming a regex will not parse the DOM tree in the same fashion), are there any suggestion to achieve the desired results?

user1257332
  • 111
  • 7
  • 2
    Regexps will not help here. How do you know whether the "hello" is in an element, attribute...comment...whether it's an element name itself...without a full-fledged HTML parser? In this case, anything you'd want to do with a regex, you'd be able to use `indexOf` for. Anything past that, i wouldn't trust a regex to handle. – cHao Mar 08 '12 at 15:28
  • 1
    (Insert obligatory link to http://stackoverflow.com/q/1732348/319403 here.) – cHao Mar 08 '12 at 15:38

2 Answers2

1

You can go through the document, or some parent element, and examine each text node, returning an array of the nodes that have data that matches your search text.

This gives you an array of the actual nodes that matched , if you want to manipulate them somehow.

Or, if you just want to read the paths to each match, you can return the paths instead of the nodes.

This example takes three functions- one to recurse the tree, looking for text nodes, one to trace a nodes descent from root, and one to match the text and return the path to its node as a string. The first two are reusable, the third is a one off.

document.deepText= function(node, fun){
    var A= [], tem;
    fun= fun || function(n){
        return n
    };
    if(node){
        node= node.firstChild;
        while(node!= null){
            if(node.nodeType== 3){
                tem= fun(node);
                if(tem!= undefined) A[A.length]= tem;
            }
            else A= A.concat(document.deepText(node, fun));
            node= node.nextSibling;
        }
    }
    return A;
}

//returns an array of parent elements

document.descent= function(node, pa){
    var A= [];
    pa= pa || document.documentElement;
    while(node){
        A[A.length]= node;
        if(node== pa) return A.reverse();
        node= node.parentNode;
    }
}

//This one returns an array containing the 'paths' to each matching node

// almost all of it is spent making strings for the paths

//pass it a regexp or a string

function pathstoText(rx, pa){
    pa= pa || document.body;
    if(!(rx instanceof RegExp)) rx= RegExp('\\b'+rx+'\\b', 'g');
    var matches= document.deepText(pa, function(itm){
        if(rx.test(itm.data)){
            return document.descent(itm).map(function(who){
                if(who.nodeType== 3) return '="'+who.data.match(rx)+'"';
                var n= 1, sib= who.previousSibling, tag= who.tagName;
                if(who.id) return tag+'#'+who.id;
                else{
                    while(sib){
                        if(sib.tagName=== tag)++n;
                        sib= sib.previousSibling;
                    }
                    if(n== 1) n= '';
                    else n= '#'+n;
                    return who.tagName+n;
                }
            }).join('> ');
        }
    });
    return matches.join('\n');
}

//A couple examples

pathstoText('Help') //finds 'Help' on a button

HTML> BODY> DIV#evalBlock> DIV#evalBar> BUTTON#button_009> ="Help"

pathstoText(/\bcamp[\w]*/ig)

 finds 'Camp,camping,etc on a page
found in 2nd paragraph of div #page3, 
found 2 instances in fifth paragraph on div#page6,
and so on.

HTML> BODY> DIV#bookview> DIV#pagespread> DIV#page3> P#2>= "Camp"
HTML> BODY> DIV#bookview> DIV#pagespread> DIV#page3> P#4>= "camp"
HTML> BODY> DIV#bookview> DIV#pagespread> DIV#page3> P#12>= "camping" 
HTML> BODY> DIV#bookview> DIV#pagespread> DIV#page4> P#3>= "camp"
HTML> BODY> DIV#bookview> DIV#pagespread> DIV#page4> P#7>= "camp"
HTML> BODY> DIV#bookview> DIV#pagespread> DIV#page5> P#3>= "Camp"
HTML> BODY> DIV#bookview> DIV#pagespread> DIV#page5> P#5>= "camp"
HTML> BODY> DIV#bookview> DIV#pagespread> DIV#page5> P#7>= "camp"
HTML> BODY> DIV#bookview> DIV#pagespread> DIV#page6> P#5>= "camp,camp"

//oh yeah-

if(!Array.prototype.map){
    Array.prototype.map= function(fun, scope){
        var T= this, L= T.length, A= Array(L), i= 0;
        if(typeof fun== 'function'){
            while(i< L){
                if(i in T){
                    A[i]= fun.call(scope, T[i], i, T);
                }
                ++i;
            }
            return A;
        }
    }
}
kennebec
  • 102,654
  • 32
  • 106
  • 127
0

I am curious to know if one could search the entire DOM with a regex that could then essentially identify the path used to get to the matching node.

Well, it's theoretically possible, but very painful (read: you don't want to do that). You are much better off using a parser to parse HTML.

Community
  • 1
  • 1
Qtax
  • 33,241
  • 9
  • 83
  • 121
  • Assuming this is to all be done in JavaScript, it sounds like things could get a little cumbersome. Time to journey to the books to identify something worthwhile to use. – user1257332 Mar 08 '12 at 16:24
  • @user1257332, just use the DOM available in JS and search the text nodes? – Qtax Mar 08 '12 at 16:26