Regex to search html return, but not actual html jQuery

Question

I'm making a highlighting plugin for a client to find things in a page and I decided to test it with a help viewer im still building but I'm having an issue that'll (probably) require some regex.

I do not want to parse HTML, and im totally open on how to do this differently, this just seems like the the best/right way.

http://oscargodson.com/labs/help-viewer

http://oscargodson.com/labs/help-viewer/js/jquery.jhighlight.js

Type something in the search... ok, refresh the page, now type, like, class or class=" or type <a you'll notice it'll search the actual HTML (as expected). How can I only search the text?

If i do .text() it'll vaporize all the HTML and what i get back will just be a big blob of text, but i still want the HTML so I dont lose formatting, links, images, etc. I want this to work like CMD/CTRL+F.

You'd use this plugin like:

$('article').jhighlight({find:'class'});

To remove them:

.jhighlight('remove')

==UPDATE==

While Mike Samuel's idea below does in fact work, it's a tad heavy for this plugin. It's mainly for a client looking to erase bad words and/or MS Word characters during a "publishing" process of a form. I'm looking for a more lightweight fix, any ideas?

score 2 · Answer 1 · answered Mar 23 '11 at 10:21

You really don't want to use eval, mess with innerHTML or parse the markup "manually". The best way, in my opinion, is to deal with text nodes directly and keep a cache of the original html to erase the highlights. Quick rewrite, with comments:

(function($){
  $.fn.jhighlight = function(opt) {

    var options = $.extend($.fn.jhighlight.defaults, opt)
      , txtProp = this[0].textContent ? 'textContent' : 'innerText';

    if ($.trim(options.find.length) < 1) return this;

    return this.each(function(){

      var self = $(this);

      // use a cache to clear the highlights
      if (!self.data('htmlCache'))
        self.data('htmlCache', self.html());

      if(opt === 'remove'){
        return self.html( self.data('htmlCache') );
      }

     // create Tree Walker
     // https://developer.mozilla.org/en/DOM/treeWalker
     var walker = document.createTreeWalker(
          this, // walk only on target element
          NodeFilter.SHOW_TEXT,
          null,
          false
      );

      var node
        , matches
        , flags = 'g' + (!options.caseSensitive ? 'i' : '')
        , exp = new RegExp('('+options.find+')', flags) // capturing
        , expSplit = new RegExp(options.find, flags) // no capturing
        , highlights = [];

      // walk this wayy
      // and save matched nodes for later
      while(node = walker.nextNode()){
        if (matches = node.nodeValue.match(exp)){
          highlights.push([node, matches]);
        }
      }

      // must replace stuff after the walker is finished
      // otherwise replacing a node will halt the walker
      for(var nn=0,hln=highlights.length; nn<hln; nn++){

        var node = highlights[nn][0]
          , matches = highlights[nn][1]
          , parts = node.nodeValue.split(expSplit) // split on matches
          , frag = document.createDocumentFragment(); // temporary holder

        // add text + highlighted parts in between
        // like a .join() but with elements :)
        for(var i=0,ln=parts.length; i<ln; i++){

          // non-highlighted text
          if (parts[i].length)
            frag.appendChild(document.createTextNode(parts[i]));

          // highlighted text
          // skip last iteration
          if (i < ln-1){
            var h = document.createElement('span');
            h.className = options.className;
            h[txtProp] = matches[i];
            frag.appendChild(h);
          }
        }
        // replace the original text node
        node.parentNode.replaceChild(frag, node);
      };

    });
  };

 $.fn.jhighlight.defaults = {
    find:'',
    className:'jhighlight',
    color:'#FFF77B',
    caseSensitive:false,
    wrappingTag:'span'
 };

})(jQuery);

If you're doing any manipulation on the page, you might want to replace the caching with another clean-up mechanism, not trivial though.

You can see the code working here: http://jsbin.com/anace5/2/

You also need to add display:block to your new html elements, the layout is broken on a few browsers.

hmm. I don't remember why I put that regex with a capturing group in. You can probably keep just the plain one. — Ricardo Tomasi, Mar 23 '11 at 10:25
Wow, huge props for this. You for sure deserve the 50 points. Now don't go on spending it all in one place ;) — Oscar Godson, Mar 24 '11 at 21:00
thanks. I just noticed a few more things: the txtProp test might fail if the element is empty, and regular expression searches work (i.e. 10 character words: \b\w{10}\b) :D — Ricardo Tomasi, Mar 24 '11 at 21:55
What are your thoughts on this regex tho: `('+settings.find+')(?![^><]+>)` someone at work suggested it and my code works without changing anything with it. Could you explain why eval() is bad? Your code also works, just looking for what best and why :) — Oscar Godson, Mar 24 '11 at 22:01
That regex is matching tags. You don't need to care about tags, only text. Your current code ends up breaking the HTML because it replaces things where it shouldn't. Parsing HTML is no easy task, we have the DOM to manipulate it easily. — Ricardo Tomasi, Mar 25 '11 at 17:30
Well, the regex in the last comment I posted doesn't break the HTML as it finds the content outside of a tag so, `Take an online class or a class at the school!` it wont match `class=""`. Is there a reason not to use this solution? — Oscar Godson, Mar 25 '11 at 20:23
It won't match "one two" in `
one two
` either. You're still parsing HTML when the browser has already done it for you. I'm rooting for clean/clear/mantainable code, but you should choose whatever works best for you :) — Ricardo Tomasi, Mar 26 '11 at 07:47

score 0 · Answer 2 · answered Mar 16 '11 at 21:13

0

In the javascript code prettifier, I had this problem. I wanted to search the text but preserve tags.

What I did was start with HTML, and decompose that into two bits.

The text content
Pairs of (index into text content where a tag occurs, the tag content)

So given

Lorem <b>ipsum</b>

I end up with

text = 'Lorem ipsum'
tags = [6, '<b>', 10, '</b>']

which allows me to search on the text, and then based on the result start and end indices, produce HTML including only the tags (and only balanced tags) in that range.

answered Mar 16 '11 at 21:13

Mike Samuel

118,113
30
216
245

Great thanks. So, just trying to get my head around this. If this were a live search I'd have to grab the HTML and text for each keystroke? Or, i could have a timeout function waiting for a slight pause, but seems CPU intensive, or do you suggest parsing this all on page load? – Oscar Godson Mar 16 '11 at 21:27
@Oscar, This structure can be parsed once and cached. Think of it as part of your search index. For the pattern that you're matching, you can build that as the user types, and apply it to the text. How often you do that depends on the size of text, since the cost of matching a simple regex against a text string is O(text.length). – Mike Samuel Mar 16 '11 at 21:36
Yeah, this would be a perfect use case for some localStorage :) – Oscar Godson Mar 16 '11 at 21:37

score 0 · Answer 3 · edited May 23 '17 at 09:58

0

Have a look here: getElementsByTagName() equivalent for textNodes. You can probably adapt one of the proposed solutions to your needs (i.e. iterate over all text nodes, replacing the words as you go - this won't work in cases such as <tag>wo</tag>rd but it's better than nothing, I guess).

edited May 23 '17 at 09:58

Community

1
1

answered Mar 21 '11 at 22:06

CAFxX

28,060
6
41
66

Do you have an example of how i could implement this? – Oscar Godson Mar 22 '11 at 22:26
@Oscar, use one of the methods outlined in the replies to the question I linked to iterate over all TextNodes, and use `jhighlight()` on each of them. – CAFxX Mar 23 '11 at 06:53

mnelson · Answer 4 · 2011-03-22T06:10:11.383

0

I believe you could just do:

$('#article :not(:has(*))').jhighlight({find : 'class'});

Since it grabs all leaf nodes in the article it would require valid xhtml, that is, it would only match link in the following example:

<p>This is some paragraph content with a <a href="#">link</a></p>

DOM traversal / selector application could slow things down a bit so it might be good to do:

article_nodes = article_nodes || $('#article :not(:has(*))');
article_nodes.jhighlight({find : 'class'});

edited Mar 22 '11 at 06:10

answered Mar 22 '11 at 05:20

mnelson

2,992
1
17
19

That seems to break normal finds though. Make that your selector and then type "example" only the "e" gets highlighted and then it stops highlighting altogether. :\ Any other ideas why that'd be? – Oscar Godson Mar 22 '11 at 21:16
can you check to see what the output is for: article_nodes.map(function(i,e){ return e.html(); }); – mnelson Mar 22 '11 at 23:13

avrelian · Answer 5 · 2011-03-23T08:33:39.693

0

May be something like that could be helpful

>+[^<]*?(s(<[\s\S]*?>)?e(<[\s\S]*?>)?e)[^>]*?<+

The first part >+[^<]*? finds > of the last preceding tag

The third part [^>]*?<+ finds < of the first subsequent tag

In the middle we have (<[\s\S]*?>)? between characters of our search phrase (in this case - "see").

After regular expression searching you could use the result of the middle part to highlight search phrase for user.

edited Mar 23 '11 at 08:33

answered Mar 22 '11 at 08:42

avrelian

798
6
10

Seems like it could work, but how would I build that regex each time? Are you saying to split the search string and then loop through and build this regex then eval() it? Or, do you know of a better way? – Oscar Godson Mar 22 '11 at 21:23
This doesnt appear to be working for me. For example, when i do a str.match() on "ex" i get: https://skitch.com/oscargodson/rsdy8/developer-tools-http-localhost-8888-help-viewer-search-ex And this is what my regex being created dynamically is for this search: (>+[^<]*?)(e(<[\s\S]*?>)?x)([^>]*?<+) Screenshot of what happens in the plugin: https://skitch.com/oscargodson/rsdy4/help-viewer – Oscar Godson Mar 22 '11 at 22:23
Oscar, I have modified my regex. Could you slightly change the script jQuery.jhighlights - remove `(` and `)` from `eval('/('+settings.find+')/g'`. Now `$1` would be suffice. – avrelian Mar 23 '11 at 08:40

Regex to search html return, but not actual html jQuery

5 Answers5

Linked