JavaScript RegExp #hasgtag replace into link without hyper hashlink in html

Question

I want to replace #hashtag text into something <a href="http://example.com/foo=hashtag"> #hasgtag</a> with JavaScript or jQuery

Here I tried:

   <!DOCTYPE html>
<html>
<body>
<button onclick="myFunction()">Try it</button>
<p id="demo">Please visit #Microsoft! #facebook <a href="#link"> Somelink</a>
</p>
<script>
function myFunction() {
    var str = document.getElementById("demo").innerHTML; 
   var txt = str.replace(/#\w+\.?\w+/g,"<a href=\"http://example.com?hashtag=selectedteg\">#Selected</a> ");
    document.getElementById("demo").innerHTML = txt;
}
</script>
</body>
</html>

But This result returned...

<p id="demo">Please visit <a href="http://example.com?hashtag=selectedteg">#Selected</a> ! <a href="http://example.com?hashtag=selectedteg">#Selected</a>  <a href="&lt;a href=" http:="" example.com?hashtag="selectedteg&quot;">#Selected</a> "&gt; Somelink
</p>

I want to result be like

<p id="demo">Please visit <a href="http://example.com?hashtag=Microsoft">#Microsoft</a> ! <a href="http://example.com?hashtag=facebook">#facebook</a>  <a href="#link">Somelink</a>
</p>

Sigh... please read http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags. — bgoldst, Feb 18 '15 at 16:42

score 4 · Answer 1 · edited May 23 '17 at 12:31

Wow! This was a surprisingly difficult problem, although it seems like it should be simple at first glance.

The problem is that, strictly speaking, your requirement demands that only text nodes be processed to transform hashtags into links. Existing HTML should not be touched at all.

A naïve approach (seen in the other answers) would attempt to devise a complex regular expression to dodge the HTML. Although this may appear to work for some cases, even nearly all practical cases, it is absolutely not foolproof. Regular expressions are simply not powerful enough to fully parse HTML; it is just too complex a language. See the excellent and rather famous Stack Overflow answer at RegEx match open tags except XHTML self-contained tags. It can't be done perfectly, and should never be done at all.

Rather, the correct approach is to traverse the HTML tree using a recursive JavaScript function, and replace all target text nodes with processed versions of themselves, which, importantly, may involve the introduction of (non-text) HTML markup inside the text node.

jQuery can be used to accomplish this with minimal complexity, although the task itself necessitates a certain amount of complexity, which, honestly, can't be avoided. As I said, this is a surprisingly difficult problem.

HTML

<button onclick="tryItClick()">Try it</button>
<p id="demo">Please visit #Microsoft! #facebook <a href="#link">Somelink</a>
</p>

JavaScript

if (!window.Node) {
    window.Node = {
        ELEMENT_NODE                :  1,
        ATTRIBUTE_NODE              :  2,
        TEXT_NODE                   :  3,
        CDATA_SECTION_NODE          :  4,
        ENTITY_REFERENCE_NODE       :  5,
        ENTITY_NODE                 :  6,
        PROCESSING_INSTRUCTION_NODE :  7,
        COMMENT_NODE                :  8,
        DOCUMENT_NODE               :  9,
        DOCUMENT_TYPE_NODE          : 10,
        DOCUMENT_FRAGMENT_NODE      : 11,
        NOTATION_NODE               : 12
    };
} // end if

window.linkify = function($textNode) {
    $textNode.replaceWith($textNode.text().replace(/#(\w+\.?\w+)/g,'<a href="http://example.com?hashtag=$1">#$1</a>'));
}; // end linkify()

window.processByNodeType = function($cur, nodeTypes, callback, payload ) {
    if (!nodeTypes.length)
        nodeTypes = [nodeTypes];
    for (var i = 0; i < $cur.length; ++i) {
        if ($.inArray($cur.get(i).nodeType, nodeTypes ) >= 0)
            callback($cur.eq(i), $cur, i, payload );
        processByNodeType($cur.eq(i).contents(), nodeTypes, callback, payload );
    } // end for
} // end processByNodeType()

window.tryItClick = function(ev) {
    var $top = $('#demo');
    processByNodeType($top, Node.TEXT_NODE, linkify );
}; // end tryItClick()

http://jsfiddle.net/3u6jt988/

It's always good to write general code where possible, to maximize reusability, and often simplicity (although too much generality can lead to excessive complexity; there's a tradeoff there). I wrote processByNodeType() to be a very general function that uses jQuery to traverse a subtree of the HTML node tree, starting from a given top node and working its way down. The purpose of the function is to do one thing and one thing only: to call the given callback() function for all nodes encountered during the traversal that have nodeType equal to one of the whitelisted values given in nodeTypes. That's why I included an enumeration of node type constants at the top of the code; see http://code.stephenmorley.org/javascript/dom-nodetype-constants/.

This function is powerful enough to be called once in response to the click event, passing it the #demo element as the top node, whitelisting only Node.TEXT_NODE nodes, and providing linkify() as the callback.

When linkify() is called, it just takes its first argument, which is the node itself, and does the exact replacement you devised (although capture group backreferences had to be added to properly replace the text with the hashtag). The last piece of the puzzle was to replace the text node with whatever new node structure is needed to effect the replacement, which, if there was indeed a hashtag to replace, would involve the introduction of new HTML structure over the old plain text node. Fortunately, jQuery, whose awesomeness knows no bounds, makes this so incredibly easy that it can be accomplished with a sweet one-liner:

$textNode.replaceWith($textNode.text().replace(/#(\w+\.?\w+)/g,'<a href="http://example.com?hashtag=$1">#$1</a>'));

As you can see, a single call to text() gets the text content of the plain text node, then the replace() function on the string object is called to replace any hashtag with HTML, and then jQuery's replaceWith() method allows us to replace the whole text node with the generated HTML, or leave the original plain text in place if no substitution was performed.

References

you absolutlely right about the dangers of using regex to capture text nodes. However, for my defense, as the capture is restricted to a p tag in the example, my naive approch could be sufficient. — Gaël Barbin, Feb 18 '15 at 16:58
@Gael, thank you. I agree, your solution is sufficient for the example presented. +1 — bgoldst, Feb 18 '15 at 17:42

Gaël Barbin · Accepted Answer · 2015-02-18T19:10:10.620

You have to capture the text with parenthesis, but have also to capture just the text, not what is in the html tags. See comments in the function.

function hashtagReplace() {
  
    var text = document.getElementById("demo").innerHTML; 

 //you have first to capture the text, to avoid the capture of #link in your example 
 //The text is somewhare between the start of the input, or ">" and the end of the input and "<"
 var result = text.replace( /(^.|>)([^<]*)(<|.$)/g ,function(match, start, capture, end ){

  //then you capture the hashtag text, and replace all the hashtag (#+hashtag_word) by the link. 
  //you set the text captured by the parentethis with $1
  var hashtagsReplaced= (start+capture+end).replace(/\#(\w+)/g,"<a href=\"http://example.com?hashtag=$1\">#$1</a>")

        
   //you return all the html 
          return hashtagsReplaced;
 });

 //finally you replace the html in the document
        document.getElementById("demo").innerHTML = result;
}

<!DOCTYPE html>
<html>
<body>
<button onclick="hashtagReplace()">Try it</button>
<p id="demo">#Microsoft Please visit #Microsoft ! #facebook <a href="#link"> Somelink</a>
</p>
</body>
</html>

see the bgoldst answer to have a more reliable solution. – Gaël Barbin Feb 18 '15 at 17:00 — Gaël Barbin, Feb 18 '15 at 17:00

Matt Burland · Answer 3 · 2015-02-18T14:58:09.870

You need to capture the group and then use it in the replace. Something like:

var txt = str.replace(/#(\w+\.?\w+)/g,"<a href=\"http://example.com?hashtag=$1\">#$1</a> ");

Putting brackets around the part you want to capture makes it a capture group and then the captured group will be inserted at the $1 token in the replacement string.

Of course, your bigger problem is that your regex matches your existing link and tries to substitute in there too, which completely messes things up. This is why it's not a great idea to use a regex to parse HTML. You could work on your regex to exclude existing links, but that quickly becomes a headache. Use DOM manipulation instead.

You could just change your regex to:

/\s(?!href=")#(\w+\.?\w+)/g

Which takes advantage of the fact that the #link in your existing link isn't proceeded by a space. So you get something like this:

function myFunction() {
  var str = document.getElementById("demo").innerHTML;
  var txt = str.replace(/\s(?!href=")#(\S+)/g, "<a href=\"http://example.com?hashtag=$1\"> #$1</a> ");
  document.getElementById("demo").innerHTML = txt;
}

<button onclick="myFunction()">Try it</button>
<p id="demo">Please visit #Microsoft! #facebook
  <a href="#link"> Somelink</a>
</p>

I am Sorry to mention but this also replace also style color code like #000 . ie.
Hello world
http://jsfiddle.net/nsfn320q/3/ — Asik, Feb 18 '15 at 16:00
@Asik: Not if you don't have a space before the `#`, but yeah, that kind of goes to my point about why doing this with regex is dodgy in the first place. — Matt Burland, Feb 18 '15 at 16:31

JavaScript RegExp #hasgtag replace into link without hyper hashlink in html

3 Answers3

HTML

JavaScript

References