-1

This is in reference to a question I posted last week.

I want to look for all valid phone numbers on a page. If they are not already in a link, I want to create a 'click to call' link to be used on mobile browsers. I received a couple of good answers on the original question I posted, but wanted to try a different approach/expand on the feedback.

I'm using jQuery and Regex to filter the contents of a page and make the links clickable.

Here is what I've come up with:

    var phoneRegex = new RegExp(/([(]?\d{3}[)]?[(\s)?.-]\d{3}[\s.-]\d{4})(?![^<]*>|[^<>]*<\/)/g);
    var phoneNums = $( "body *" ).filter(function() {
    var tagname = $(this).prop("tagName");
    tagname = tagname === null ? "" : tagname.toLowerCase(); 
    if (tagname == "a") { 
        return false;
    }
    var match = $(this).html().match(phoneRegex);
    if (match === null || match.length === 0) {
        return false;
    }
    return true;
});
phoneNums.html(function() {
    var newhtml = $(this).html().replace(phoneRegex, function(match) {
        var phonenumber = match.replace(/ /g, "").replace(/-/g, "").replace(/\(/g, "").replace(/\)/g, "");
        var link = '<a href="tel:' + phonenumber + '">' + match + '</a>';
        return link;
    });
    return newhtml;
});

So, basically I search for everything in the body looking for each tag (excluding anchor tags). I'm matching the regex and storing the values in the 'phoneNums' variable. From there I remove all spaces, dashes, and parenthesis so the number will format correctly for the tel attribute. So a number such as this: (123) 456-7890 will format like this: <a href="tel:1234567890">(123) 456-7890</a>

The problem I see with doing this is if these numbers are in nested tags on the page, I'm getting the results multiple times. (This can be seen if you do a console.log on link, just before it is returned.) The results are correct, but wondering if this makes sense.

Is there a more efficient way of doing this? Thanks in advance!

Community
  • 1
  • 1
Tim
  • 881
  • 8
  • 19
  • This is why I used `XPath` in my answer; it lets you find the text nodes directly, and process them without recourse to parsing/modifying HTML using regular expressions. You want to push the work down as low as possible, instead of running find and replace on `.html` (which means you're replacing tons of stuff that could include stuff like `script` tags, element attributes, etc.). – ShadowRanger Dec 22 '15 at 00:15
  • @ShadowRanger - Do you have an example of how I can reinsert the formatted number back into the DOM once it is found in that format? I'm new to the whole XPath solution. – Tim Dec 22 '15 at 00:18
  • XPath isn't actually involved once you find the nodes. I've expanded my original answer to include code showing how the replacement is performed; should probably take a look at that and delete this question. – ShadowRanger Dec 22 '15 at 00:38

1 Answers1

1

As before (this is copy-pasted from the original question after I updated it to include the code for performing element replacement), Don't use regular expressions to parse HTML. Use HTML/DOM parsers to get the text nodes (the browser can filter it down for you, to remove anchor tags and all text too short to contain a phone number for instance) and you can check the text directly.

For example, with XPath (which is a bit ugly, but has support for dealing with text nodes directly in a way most other DOM methods do not):

// This query finds all text nodes with at least 12 non-whitespace characters
// who are not direct children of an anchor tag
// Letting XPath apply basic filters dramatically reduces the number of elements
// you need to process (there are tons of short and/or pure whitespace text nodes
// in most DOMs)
var xpr = document.evaluate('descendant-or-self::text()[not(parent::A) and string-length(normalize-space(self::text())) >= 12]',
                            document.body, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null);
for (var i=0, len=xpr.snapshotLength; i < len; ++i) {
    var txt = xpr.snapshotItem(i);
    // Splits with grouping to preserve the text split on
    var numbers = txt.data.split(/([(]?\d{3}[)]?[(\s)?.-]\d{3}[\s.-]\d{4})/);
    // split will return at least three items on a hit, prefix, split match, and suffix
    if (numbers.length >= 3) {
        var parent = txt.parentNode; // Save parent before replacing child
        // Replace contents of parent with text before first number
        parent.textContent = numbers[0];

        // Now explicitly create pairs of anchors and following text nodes
        for (var i = 1; i < numbers.length; i += 2) {
            // Operate in pairs; odd index is phone number, even is 
            // text following that phone number
            var anc = document.createElement('a');
            anc.href = 'tel:' + numbers[i].replace(/\D+/g, '');
            anc.textContent = numbers[i];
            parent.appendChild(anc);
            parent.appendChild(document.createTextNode(numbers[i+1]));
        }
        parent.normalize(); // Normalize whitespace after rebuilding
    }
}

For the record, the basic filters help a lot on most pages. For example, on this page, right now, as I see it (will vary by user, browser, browser extensions and scripts, etc.) without the filters, the snapshot for the query 'descendant-or-self::text()' would have 1794 items. Omitting text parented by anchor tags, 'descendant-or-self::text()[not(parent::A)]' gets it down to 1538, and the full query, verifying that the non-whitespace content is at least twelve characters long gets it down to 87 items. Applying the regex to 87 items is chump change, performance-wise, and you've removed the need to parse HTML with an unsuitable tool.

Community
  • 1
  • 1
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271