1

I am looking for a regex for Javascript to search for text ("span" for example) in HTML.

Example:

<div>Lorem span Ipsum dor<a href="blabla">lablala</a> dsad <span>2</span> ... </div>

BUT only the "span" after "Lorem" should be matched, not the <span> tag.
For a second example, if we search for "bla", only the bold text should be matched.

EDIT:

The HTML is gotten by innerHTML, the matchings will be surrounded with <span class="x">$text</span>, an then rewritten to innerHTML of this node, and all these without killing the other tags.

EDIT2 and My Solution:

I wrote my own search, it is searching char by char, with cache and flags.

Thanks for ure Help guys!

Community
  • 1
  • 1
Bebna
  • 109
  • 1
  • 8

8 Answers8

2

You could use dom methods to process every text node.

This method takes a parent node for the first argument and loops through all of its childnodes, processing the text nodes with the function passed as the second argument. The function is where you would operate on the test node's data, to find or replace or delete or wrap the found text in a 'highlighted' span, for example.

You can call the function with only the first argument, and it will return an array of text nodes, and you can then use that array to manipulate the text- the array items in that case are each nodes, and have data, parents and siblings.

document.deepText= function(hoo, fun){
    var A= [], tem;
    if(hoo){
        hoo= hoo.firstChild;
        while(hoo!= null){
            if(hoo.nodeType== 3){
                if(fun){
                    if((tem= fun(hoo))!== undefined){
                       A[A.length]= tem;
                    }
                }
                else A[A.length]= hoo;
            }
            else A= A.concat(arguments.callee(hoo, fun));
            hoo= hoo.nextSibling;
        }
    }
    return A;
}

//test case

function ucwords(pa, rx){
    var f= function(node){
        var t= node.data;
        if(t && t.search(rx)!=-1){
            node.data= t.replace(rx,function(w){return w.toUpperCase()});
            return node;
        }
        return undefined;
    }
    return document.deepText(pa, f);
}

ucwords(document.body,/\bspan\b/ig)

kennebec
  • 102,654
  • 32
  • 106
  • 127
1

If you've got the HTML in a DOM element, you may use textContent/innerText to grab the text (without any HTML tags):

var getText = function(el) {
    return el.textContent || el.innerText;
};
// usage:
// <div id="myElement"><span>Lorem</span> ipsum <em>dolor<em></div>
alert(getText(document.getElementById('myElement'))); // "Lorem ipsum dolor"
moff
  • 6,415
  • 31
  • 30
1
(?<!\<|/)span

This should give all span occurrences that are not tags. Hope this helped at least a bit :)

Explanation: find every span occurrence that is not preceded by < or /

Peter Perháč
  • 20,434
  • 21
  • 120
  • 152
  • sry but there is no lookbehind in js: http://www.regular-expressions.info/javascript.html and what is with "href" for example? – Bebna Apr 07 '09 at 13:59
  • then try changing approach. don't force javascript to solve problems it isn't designed to solve. whatever you're doing, try looking at the task at hand from a different perspective. – Peter Perháč Apr 07 '09 at 15:23
1

What you want to do can be done pretty easily with jQuery:

  $("span:contains('blah'))

If you want to do regular expression matching do what was done in this previous stack overflow example:

jQuery Regular Expressions

For a more elegant solution, create a custom selector.

Community
  • 1
  • 1
cgp
  • 41,026
  • 12
  • 101
  • 131
1
/span(?=[^>]*<)/

In other words, looking ahead from the end of the word "span" there is no closing angle bracket before the next opening angle bracket, so we can't be inside a tag. Supposedly, quoted attribute values can contain closing angle brackets, though I've never seen it done. But, to cover that possibility, you can use this regex:

/span(?=(?:[^>"']+|"[^"]*"|'[^']*')*<)/
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
0

** found A NEW SOLUTION using lookaheads

 var pageHTML ="  <span aa span > span asa span";
 var regex = "span(?!([^<]+)?>)";

this regex will only found the word "span" only if it doesn't have "<" then it has ">" after it.

** the old solution

Here is my solution, I am looking for "asd" and if there is open and close tags around it, I ignore this match.

I am doing that, by looking to the right and the left of the matched word, if I found it's being enclosed with tags, I return the same matched word " I dont replace it", If not, I replace it with the text I need

    var pageHTML ="  < aa asd > asd < asd";
    var regex = "asd";
    var pattern = new RegExp(regex, "gi");
    var replaceWord = "dsa";

    //Replace all instances of word/words with our special spans
    pageHTML = pageHTML.replace(pattern, function(match, index, original){
        var leftIndex = index;  
        var rightIndex = parseInt(parseInt(index)+match.length);

        var insideTag = false;
        var foundOpenTag = false;

        for(; leftIndex > 0; leftIndex--){
           if(pageHTML.charAt(leftIndex) == ">")
               break;
           if(pageHTML.charAt(leftIndex) == "<"){
                   foundOpenTag = true;
                   break;
               }
        }

        if(!foundOpenTag){
            return replaceWord;
        }

      for(; rightIndex < pageHTML.length ; rightIndex++){
           if(pageHTML.charAt(rightIndex) == "<")
               break;
           if(pageHTML.charAt(rightIndex) == ">" ){
                   insideTag = true;
                   break;
               }
        }
        if(insideTag)
            return match;
        else return replaceWord;


            });

alert(pageHTML);

0

If I understand you correctly, you want to search for a word, but only words which are not part of an HTML tag.

I don't have an exact answer for you, but some tools I use for developing regular expressions are this site: http://www.regular-expressions.info/ and this program: http://www.radsoftware.com.au/regexdesigner/

Brandon Montgomery
  • 6,924
  • 3
  • 48
  • 71
0

This might be impossible in the general case because you will need to count opening and closing tags what is not possible with regular expressions.

Regex is not a smart solution for handling XML. Instead you should use HTML or XML DOM methods to extract the required information.

If you really want or need to use regular expressions you might try something like the following.

>[^<]*bla[^<]*<

But I am quite sure that this will not work in the general case.

Daniel Brückner
  • 59,031
  • 16
  • 99
  • 143