2

How do I find every word on a page beginning with http:// and wrap tags around it?

Can I use something like regex perhaps?

Donald Duck
  • 8,409
  • 22
  • 75
  • 99
Tim
  • 6,986
  • 8
  • 38
  • 57

3 Answers3

5

I disagree heavily that jQuery can be much use in finding a solution here. Granted you have to get down and dirty with some of the textNode element attributes but putting the DOM back together again after you split your matched node can be made a wee bit easier using the jQuery library.

The following code is documented inline to explain the action taken. I've written it as a jQuery plugin in case you just want to take this and move it around elsewhere. This way you can scope which elements you want to convert URLs for or you can simply use the $("body") selector.

(function($) {
    $.fn.anchorTextUrls = function() {
        // Test a text node's contents for URLs and split and rebuild it with an achor
        var testAndTag = function(el) {
            // Test for URLs along whitespace and punctuation boundaries (don't look too hard or you will be consumed)
            var m = el.nodeValue.match(/(https?:\/\/.*?)[.!?;,]?(\s+|"|$)/);

            // If we've found a valid URL, m[1] contains the URL
            if (m) {
                // Clone the text node to hold the "tail end" of the split node
                var tail = $(el).clone()[0];

                // Substring the nodeValue attribute of the text nodes based on the match boundaries
                el.nodeValue = el.nodeValue.substring(0, el.nodeValue.indexOf(m[1]));
                tail.nodeValue = tail.nodeValue.substring(tail.nodeValue.indexOf(m[1]) + m[1].length);

                // Rebuild the DOM inserting the new anchor element between the split text nodes
                $(el).after(tail).after($("<a></a>").attr("href", m[1]).html(m[1]));

                // Recurse on the new tail node to check for more URLs
                testAndTag(tail);
            }

            // Behave like a function
            return false;
        }

        // For each element selected by jQuery
        this.each(function() {
            // Select all descendant nodes of the element and pick out only text nodes
            var textNodes = $(this).add("*", this).contents().filter(function() {
                return this.nodeType == 3
            });


            // Take action on each text node
            $.each(textNodes, function(i, el) {
                testAndTag(el);
            });
        });
    }
}(jQuery));

$("body").anchorTextUrls(); //Sample call

Please keep in mind that given the way I wrote this to populate the textNodes array, the method will find ALL descendant text nodes, not just immediate children text nodes. If you want it to replace URLs only amongst the text within a specific selector, remove the .add("*", this) call that adds all the descendants of the selected element.

Here's a fiddle example.

lsuarez
  • 4,952
  • 1
  • 29
  • 51
  • @Tim For what it's worth the regular expression tries to take into account common punctuation or whitespace as best as I could figure might appear around the end of a URL so it doesn't have to be specifically whitespace delimited. – lsuarez Mar 01 '11 at 22:12
  • Considering you've used jQuery (+less code) and wrote the REGEX FROM HELL - I'll mark this as the correct answer. Thanks for your time and JS & jQ exellence! :D – Tim Mar 02 '11 at 14:26
  • @Tim Don't look too closely, or you shall be consumed! I actually didn't write all of it, just wrote the boundary checks behind it. The base was borrowed from work by [keevkilla](http://snipplr.com/users/keevkilla/) on [Snipplr](http://snipplr.com/view/36992/improvement-of-url-interpretation-with-regex/). Should have credited where due sooner. – lsuarez Mar 02 '11 at 15:22
  • @Tim I just made a small edit to the textNodes selector that will kind of be important. Instead of .find() I needed to use .add() to make sure text node children of the top level were included in the tagging. If you want to only get text nodes from the selected element and not its children, just remove the .add() call. Also, it turns out my boundary regex was superior to the long contrived mess because it finds the limits of the URL pretty well and just needs the starting point. – lsuarez Mar 02 '11 at 18:16
  • Ah yes that's much better - the previous regex rule was making anything string joined with a . into a link :) Works like a charm now. Many thanks! – Tim Mar 02 '11 at 19:05
  • This fn works perfect with 1.7.1. Tested and no bugs. Thank you very much. Saved me a lot of time in writing code and regex. Thank you very much – Damien Keitel Feb 20 '12 at 15:16
3

This is one of those few things that jQuery doesn't directly help you with much. You basically have to walk through the DOM tree and examine the text nodes (nodeType === 3); if you find a text node containing the target text you want to wrap ("http://.....", whatever rules you want to apply), you then split the text node (using splitText) into three parts (the part before the string, the part that is the string, and the part following the string), then put the a element around the second of those.

That sounds a bit complicated, but it isn't really all that bad. It's just a recursive descent walker function (for working through the DOM), a regex match to find the things you want to replace, and then a couple of calls to splitText, createElement, insertBefore, appendChild.

Here's an example that searches for a fixed string; just add your regex matching for "http://":

walk(document.body, "foo");

function walk(node, targetString) {
  var child;

  switch (node.nodeType) {
    case 1: // Element
      for (child = node.firstChild;
           child;
           child = child.nextSibling) {
        walk(child, targetString);
      }
      break;

    case 3: // Text node
      handleText(node, targetString);
      break;
  }
}

function handleText(node, targetString) {
  var start, targetNode, followingNode, wrapper;

  // Does the text contain our target string?
  // (This would be a regex test in your http://... case)
  start = node.nodeValue.indexOf(targetString);
  if (start >= 0) {
    // Split at the beginning of the match
    targetNode = node.splitText(start);

    // Split at the end of the match
    followingNode = targetNode.splitText(targetString.length);

    // Wrap the target in an element; in this case, we'll
    // use a `span` with a class, but you'd use an `a`.
    // First we create the wrapper and insert it in front
    // of the target text.
    wrapper = document.createElement('span');
    wrapper.className = "wrapper";
    targetNode.parentNode.insertBefore(wrapper, targetNode);

    // Now we move the target text inside it
    wrapper.appendChild(targetNode);

    // Clean up any empty nodes (in case the target text
    // was at the beginning or end of a text ndoe)
    if (node.nodeValue.length == 0) {
      node.parentNode.removeChild(node);
    }
    if (followingNode.nodeValue.length == 0) {
      followingNode.parentNode.removeChild(followingNode);
    }
  }
}

Live example


Update: The above didn't handle it if there were multiple matches in the same text node (doh!). And oh what the heck, I did a regexp match — you will have to adjust the regexp, and probably do some post-processing on each match, because what's here is too simplistic. But it's a start:

// The regexp should have a capture group that
// will be the href. In our case below, we just
// make it the whole thing, but that's up to you.
// THIS REGEXP IS ALMOST CERTAINLY TOO SIMPLISTIC
// AND WILL NEED ADJUSTING (for instance: what if
// the link appears at the end of a sentence and
// it shouldn't include the ending puncutation?).
walk(document.body, /(http:\/\/[^ ]+)/i);

function walk(node, targetRe) {
  var child;

  switch (node.nodeType) {
    case 1: // Element
      for (child = node.firstChild;
           child;
           child = child.nextSibling) {
        walk(child, targetRe);
      }
      break;

    case 3: // Text node
      handleText(node, targetRe);
      break;
  }
}

function handleText(node, targetRe) {
  var match, targetNode, followingNode, wrapper;

  // Does the text contain our target string?
  // (This would be a regex test in your http://... case)
  match = targetRe.exec(node.nodeValue);
  if (match) {
    // Split at the beginning of the match
    targetNode = node.splitText(match.index);

    // Split at the end of the match.
    // match[0] is the full text that was matched.
    followingNode = targetNode.splitText(match[0].length);

    // Wrap the target in an `a` element.
    // First we create the wrapper and insert it in front
    // of the target text. We use the first capture group
    // as the `href`.
    wrapper = document.createElement('a');
    wrapper.href = match[1];
    targetNode.parentNode.insertBefore(wrapper, targetNode);

    // Now we move the target text inside it
    wrapper.appendChild(targetNode);

    // Clean up any empty nodes (in case the target text
    // was at the beginning or end of a text ndoe)
    if (node.nodeValue.length == 0) {
      node.parentNode.removeChild(node);
    }
    if (followingNode.nodeValue.length == 0) {
      followingNode.parentNode.removeChild(followingNode);
    }

    // Continue with the next match in the node, if any
    match = followingNode
      ? targetRe.exec(followingNode.nodeValue)
      : null;
  }
}

Live example

T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
  • Tim: I gave you time to expand upon it, and you did :) awesome. Thanks TJ!! If I could vote up more I would. – Tim Mar 01 '11 at 18:44
  • 1
    So, uh, can we get more votes up on this one? This is a metric ton better than my little, "this has already been solved." – buzzedword Mar 01 '11 at 18:44
  • @TJ: Sorry dude... Changing this only wraps the span around the "http://" text and not the rest of the link: walk(document.body, "http://"); please advise..! Many thanks – Tim Mar 01 '11 at 19:06
  • @Tim: That's what I meant about how you'd have to extend it to support doing a regex match or similar. I was just addressing how you find text and wrap it in elements; doing the match for the http pattern you want (probably with a regexp) is left as an exercise for the reader, as indicated in the comments. :-) – T.J. Crowder Mar 01 '11 at 19:13
  • @TJ: lol ok, fair enough :) I think I can work it out from here anyways - thanks for your help. – Tim Mar 01 '11 at 19:16
  • @Tim: I did a regexp example. It's not remotely perfect, but it's a start. Oh, and I fixed a bug in my original. – T.J. Crowder Mar 01 '11 at 19:44
  • @TJ: You are too kind, sir! ^_^ – Tim Mar 01 '11 at 19:58
-2

I am not practically but you can try it

$('a([href^="http://"])').each( function(){
        //perform your task
    })
Manish Trivedi
  • 3,481
  • 5
  • 23
  • 29