How to properly bold search terms from Twitter, strange regex case in JS

Question

I'm retrieving tweets from Twitter with the Twitter API and displaying them in my own client.

However, I'm having some difficulty properly highlighting the right search terms. I want to an effect like the following:

You can see search terms are properly bolded

The way I'm trying to do this in JS is with a function called highlightSearchTerms(), which takes the text of the tweet and an array of keywords to bold as arguments. It returns the text of the fixed tweet. I'm bolding keywords by wrapping them in a that has the class .search-term.

I'm having a lot of problems, which include:

Running a simple replace doesn't preserve case
There is a lot of conflict with the keyword being in href tags
If I try to do a for loop with a replace, I don't know how to only modify search terms that aren't in an href, and that I haven't already wrapped with the span above

An example tweet I want to be able to handle for:

Input:

This is a keyword. This is a <a href="http://search.twitter.com/q=%23keyword">
#keyword</a> with a hashtag. This is a link with kEyWoRd: 
<a href="http://thiskeyword.com">http://thiskeyword.com</a>.

Expected Output:

This is a 
<span class="search-term">keyword</span>
. This is a <a href="http://search.twitter.com/q=%23keyword"> #
<span class="search-term">keyword</span>
</a> with a hashtag. This is a link with 
<span class="search-term">kEyWoRd</span>
:<a href="http://thiskeyword.com">http://this
<span class="search-term>keyword.com</span>
</a>.

I've tried many things, but unfortunately I can't quite find out the right way to tackle the problem. Any advice at all would be greatly appreciated.

Here is my code that works for some cases but ultimately doesn't do what I want. It fails to handle for when the keyword is in the later half of the link (e.g. http://twitter.com/this_keyword). Sometimes it strangely also highlights 2 characters before a keyword as well. I doubt the best solution would resemble my code too much.

function _highlightSearchTerms(text, keywords){

    for (var i=0;i<keywords.length;i++) {

    // create regex to find all instances of the keyword, catch the links that potentially come before so we can filter them out in the next step
    var searchString = new RegExp("[http://twitter.com/||q=%23]*"+keywords[i], "ig");

    // create an array of all the matched keyword terms in the tweet, we can't simply run a replace all as we need them to retain their initial case
    var keywordOccurencesInitial = text.match(searchString);

    // create an array of the keyword occurences we want to actually use, I'm sure there's a better way to create this array but rather than try to optimize, I just worked with code I know should work because my problem isn't centered around this block
    var keywordOccurences = [];
    if (keywordOccurencesInitial != null) {
        for(var i3=0;i3<keywordOccurencesInitial.length;i3++){
            if (keywordOccurencesInitial[i3].indexOf("http://twitter.com/") > -1 || keywordOccurencesInitial[i3].indexOf("q=%23") > -1) 
                continue;
            else
                keywordOccurences.push(keywordOccurencesInitial[i3]);
        }
    }

    // replace our matches with search term
    // the regex should ensure to NOT catch terms we've already wrapped in the span
    // i took the negative lookbehind workaround from http://stackoverflow.com/a/642746/1610101
    if (keywordOccurences != null) {
        for(var i2=0;i2<keywordOccurences.length;i2++){
            var searchString2 = new RegExp("(q=%23||http://twitter.com/||<span class='search-term'>)?"+keywordOccurences[i2].trim(), "g"); // don't replace what we've alrdy replaced
            text = text.replace(searchString2, 
                function($0,$1){ 
                    return $1?$0:"<span class='search-term'>"+keywordOccurences[i2].trim()+"</span>";
                });
        }
    }

    return text;
}

@DavidThomas OK. I apologize. I'm formatting and posting it now. — Patrick Monaghan, Feb 26 '15 at 19:08

Regular Jo · Accepted Answer · 2015-03-04T20:31:33.970

Here's something you can probably work with:

var getv = document.getElementById('tekt').value;
var keywords = "keyword,big elephant"; // comma delimited keyword list
var rekeywords = "(" + keywords.replace(/\, ?/ig,"|") + ")"; // wraps keywords in ( and ), and changes , to a pipe (character for regex alternation)

var keyrex = new RegExp("(#?\\b" + rekeywords + "\\b)(?=[^>]*?<[^>]*>|(?![^>]*>))","igm")

alert(keyrex);
document.getElementById('tekt').value =  document.getElementById('tekt').value.replace(keyrex,"<span class=\"search-term\">$1</span>");

And here is a variation that attempts to deal with word forms. If the word ends with ed,es,s,ing,etc, it chops it off and also, while looking for word-boundaries at the end of the word, it also looks for words ending in common suffixes. It's not perfect, for instance the past tense of ride is rode. Accounting for that with Regex is nigh-impossible without opening yourself up to tons of false-positives.

var getv = document.getElementById('tekt').value;
var keywords = "keywords,big elephant";
var rekeywords = "(" + keywords.replace(/(es|ing|ed|d|s|e)?\b(\s*,\s*|$)/ig,"(es|ing|ed|d|s|e)?$2").replace(/,/g,"|") + ")";

var keyrex = new RegExp("(#?\\b" + rekeywords + "\\b)(?=[^>]*?<[^>]*>|(?![^>]*>))","igm")

console.log(keyrex);

document.getElementById('tekt').value =  document.getElementById('tekt').value.replace(keyrex,"<span class=\"search-term\">$1</span>");

Edit

This is just about perfect. Do you know how to slightly modify it so the keyword in thiskeyword.com would also be highlighted?

Change this line

var keyrex = new RegExp("(#?\\b" + rekeywords + "\\b)(?=[^>]*?<[^>]*>|(?![^>]*>))","igm")

to (All I did was remove both \\b's):

var keyrex = new RegExp("(#?" + rekeywords + ")(?=[^>]*?<[^>]*>|(?![^>]*>))","igm")

But be warned, you'll have problems like smiles ending up as smiles (if a user searches for mile), and there's nothing regex can do about that. Regex's definition of a word is alphanumeric characters, it has no dictionary to check.

This is just about perfect. Do you know how to slightly modify it so the keyword in http://thiskeyword.com would also be highlighted? — Patrick Monaghan, Mar 04 '15 at 19:52

How to properly bold search terms from Twitter, strange regex case in JS

1 Answers1