JavaScript Regex Exclude + Include pattern match

Question

I am using JavaScript RegExp for search highlighting on HTML content.

To do that I am using:

data.replace( new RegExp("("+search+")", 'g'), "<b id='searchHighlight'>$1</b>" );

where data is the whole of the HTML content and search is the search string.

When searching for, e.g., h, it would highlight h in words (the, there, etc...) along with instances in tags like "<h1 id="title"> Something </h1>", etc.

I can't go for an alternative approach since I need to highlight the same HTML content with the same style.

I have read solutions like:

var input = "a dog <span class='something'> had a  </span> and a cat";
// Remove anything tag-like
var temp = input.replace(/<.+?>/g, "");
// Perform the search
var matches = new RegExp(exp, "g").exec(temp);

But since I need to highlight the search text in the same HTML content, I can't simply strip out the existing tags. Is there any way to do a include and exclude search in RegExp, so that I could, for example, highlight h in "the" with "t<b id='searchHighlight'>h</b>e"
and not allow "<h1 id="title">Test</h1>" to get corrupted thus: "<<b id='searchHighlight'>h</b>1 id="title">Test</<b id='searchHighlight'>h</b>1>"?

The HTML content is static and looks like this:

    <h1 id="title">Samples</h1>
        <div id="content">
            <div  class="principle">
        <h2 id="heading">           
            PRINCIPLE</h2>


        <p>
            FDA recognizes that samples are an important part of ensuring that the right drugs are provided to the right patients. Under the Prescription Drug Marketing Act (PDMA), a sales representative is permitted to provide prescription drug samples to eligible healthcare professionals (HCPs). In order for BMS to provide this service, representatives must strictly abide by all applicable compliance standards pertaining to the distribution of samples.</p></div>
<h2 id="heading">           
            WHY DOES IT MATTER?</h2>
        <p>
            The Office of Inspector General (OIG) recognizes that samples can have monetary value to HCPs and, when used improperly, may have implications under the Federal False Claims Act and the Federal Anti-kickback Act. To minimize risk of such liability, the OIG requires the clear and conspicuous labeling of individual samples as units that cannot be sold.&nbsp; BMS and its business partners label every sample package to meet this requirement.&nbsp; Additionally, the HCP signature statement acknowledges that the samples will not be sold, billed or provided to family members or friends.</p>
        <h2 id="heading">

            WHO IS YOUR SMaRT PARTNER?</h2>
        <p>
            SMaRT is an acronym for &ldquo;Samples Management and Representatives Together&rdquo;.&nbsp; A SMaRT Partner has a thorough understanding of BMS sample requirements and is available to assist the field with any day-to-day policy or procedure questions related to sample activity. A SMaRT Partner will also:</p>

        <ul>
            <li style="margin-left:22pt;"> Monitor your adherence to BMS&rsquo;s sample requirements.</li>
            <li style="margin-left:22pt;"> Act as a conduit for sharing sample compliance issues and best practices.</li>
            <li style="margin-left:22pt;"> Respond to day-to-day sample accountability questions within two business days of receipt.</li>
        </ul>
        <p>

            Your SMaRT Partner can be reached at 888-475-2328, Option 3.</p>
        <h2 id="heading">

            BMS SAMPLE ACCOUNTABILITY POLICIES &amp; PROCEDURES</h2>
        <p>
            It is the responsibility of each sales representative to read, understand and follow the BMS Field Sample Accountability Procedures, USPSM-SOP-101. The basic expectations are:</p>
        <ul>
            <li style="margin-left:22pt;"> Transmit all sample activity by communicating your tablet to the host server on a <strong>daily</strong> basis.</li>
            <li style="margin-left:22pt;"> Maintain a four to six week inventory of samples rather than excessive, larger inventories that are more difficult to manage and increase your risk of non-compliance.</li>
            <li style="margin-left:22pt;"> Witness all HCP&rsquo;s signatures to confirm request and receipt of samples.</li>
        </ul>
</div>

The contents are all scattered and not in just one tag. So DOM manipulation is not a solution for me.

score 4 · Accepted Answer · answered Mar 13 '13 at 15:18

4

If you can be sure there are no < or > in a tag's attributes, you could just use

data = data.replace( 
    new RegExp( "(" + search + "(?![^<>]*>))", 'g' ),
        "<b id='searchHighlight'>$1</b>" );

The negative look-ahead (?![^<>]*>) prevents the replacement if > appears before < ahead in the string, as it would if inside a tag.

This is far from fool-proof, but it may be good enough.

BTW, as you are matching globally, i.e. making more than one replacement, id='searchHighlight' should probably be class='searchHighlight'.

And you need to be careful that search does not contain any regex special characters.

answered Mar 13 '13 at 15:18

MikeM

13,156
2
34
47

I think this would only check to see if the `search` string was *immediately followed by* either a `<` or a `>`... At least that's how I understand the [?! quantifier](http://www.w3schools.com/jsref/jsref_regexp_nfollow_not.asp). Good point about `id` and `class` though. – guypursey Mar 13 '13 at 15:38
@guypursey. That is incorrect, the `(?!)` is a negative look-ahead not a quantifier. The `[^<>}*` will match any characters ahead that are not `<` or `>`. – MikeM Mar 13 '13 at 15:41
Thankyou Mike. You saved my day. I don't have enough reputation to voteup your post. I will once I get that :) thanks again – leninmon Mar 14 '13 at 06:01

score 1 · Answer 2 · edited May 23 '17 at 12:07

1

you're probably aware of the fact that you try to employ the wrong tool for the job, so this is just for the record (in case you're not, you may find this insightful).

you might (most certainly will?) encounter one fundamental problem on html attributes with basically arbitrary textual content, namely title (the tooltip attribute) and data-... (generic user-defined attributes to hold arbitrary data by design) - whatever you find in the textual part of your html code, you could find there too, the replacement on which will deface balloon help and/or wreck some application logic. also note that any character of the textual content may be encoded as named or numerical entity (e.g. & -> &, &, &), which can be handled in principle but will complicate the dynamic regex (vastly in case your variable search will hold straight text).

having said all this, you MIGHT get along with data.replace( new RegExp("([>]?)[^><]*("+search+")[^><]*([<]?)", 'g'), "<b id='searchHighlight'>$1$2$3</b>" ); unless search results to be highlighted may contain characters that have semantics in regex specifications, like .+*|([{}])\, perhaps -; these you'd have to escape properly.

in summary: revise your design to save you from LOTS of trouble.

btw, why wouldn't you opt for dom traversal? you don't need to know about the actual html tags present to do that.

edited May 23 '17 at 12:07

Community

1
1

answered Mar 13 '13 at 14:49

collapsar

17,010
4
35
61

Thankyou collapsar for tip. I liked the idea of dom traversal. But the problem is inorder to set a style, I will either have to get the innerHTML and alter it or the innerText and alter it. But altering innerHtml will give same result(altering html tags). & altering innerText will show the highlighting tags as is(eg: the). – leninmon Mar 14 '13 at 06:23
you don't have to resort to innerHTML at all. the dom stores text in nodes of its own type which will not any html tags. so you'd traverse the dom, collect the contents of adjacent text nodes into a string, perform the substitution, split the resulting string at tag start and end positions (here the crucial difference to the original design applies: you know that the only tags present are those you have inserted during the substitution) and reinsert the parts as a sequence of text nodes and -nodes replacing the original nodes. all functionality is available in the dom ... – collapsar Mar 14 '13 at 08:16
... of course, as you would create dom nodes for the highlighting anyway, your replacement strings can be simple texts marking beginning and end of a highlighting section. on hitting highlight_start you create a new dom node with the next element of the array as a child text node and skip the highlight_end mark, else you just create a text node. – collapsar Mar 14 '13 at 08:27

guypursey · Answer 3 · 2013-03-13T16:24:17.877

This isn't a pure RegExp solution but, if you can't traverse the DOM, then string manipulation with functional replaces and loops like this could work for you.

Declare the variables you need and fetch the innerHTML of your document body.
Look through the data extracting any tags and saving them in an array for now. Leave a placeholder so you know where to put them back later.
With all the tags replaced with temporary placeholders in your string, you can then replace the characters you need to, using your original code but assigning the result back to data.
Then you would need to restore the tags by reversing the earlier process.
Assign the new data as the innerHTML of your document body.

This is the process in action.

Here is the code:

var data = document.body.innerHTML, // get the DOM as a string
    tagarray = [], // a place to temporarily store all your tags
    tagmatch = /<[^>]+>/g, // for matching tags
    tagplaceholder = '<>', // could be anything but should not match the RegExp above, and not be the same as the search string below
    search = 'h'; // for example; but this could be set dynamically

while (tagmatch.test(data)) {
    data = data.replace(tagmatch, function (str) {
        tagarray.push(str); // store each matched tag in your array
        return tagplaceholder; // whatever your placeholder should be
    });
}

data = data.replace( new RegExp("("+search+")", 'g'), "<b id='searchHighlight'>$1</b>" ); // now search and replace the string of your choice

while (new RegExp(tagplaceholder, 'g').test(data)) {
    data = data.replace(tagplaceholder, function (str) {
        return tagarray.shift(str); // replace the placeholders with the tags you saved earlier to restore them
    });
}

document.body.innerHTML = data; // assign the changed `data` string to the body

Obviously if you can put this all in a function of its own, so much the better, as you don't really want global variables like the above hanging around.

JavaScript Regex Exclude + Include pattern match

3 Answers3