2

I have this regex expression that searches for a phone number pattern:

[(]?\d{3}[)]?[(\s)?.-]\d{3}[\s.-]\d{4}

This matches phone numbers in this format:

123 456 7890
(123)456 7890
(123) 456 7890
(123)456-7890
(123) 456-7890
123.456.7890
123-456-7890

I want to scan an entire page (with JavaScript) looking for this match, but excluding this match that already exists inside an anchor. After the match is found, I want to convert the phone number into a click to call link for mobile devices:

(123) 456-7890 --> <a href="tel:1234567890">(123) 456-7890</a>

I'm pretty sure I need to do a negative lookup. I've tried this, but this doesn't seem to be the right idea:

(?!.*(\<a href.*?\>))[(]?\d{3}[)]?[(\s)?.-]\d{3}[\s.-]\d{4}
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Tim
  • 881
  • 8
  • 19
  • do you not want to find phone numbers in href or only if they are tel links already? – abc123 Dec 16 '15 at 21:30
  • I would like to exclude any match that is already inside of an anchor. – Tim Dec 16 '15 at 21:32
  • right unfortunately since it appears you don't know exactly what the `a` tag will look like you can't use a lookbehind because `+` and `*` aren't allowed. As shown in this SO answer http://stackoverflow.com/questions/9030305/regular-expression-lookbehind-doesnt-work-with-quantifiers-or I would likely solve this by finding all phone numbers in anchor tags and making them into non-matching for the second regex that looks for phone numbers. – abc123 Dec 16 '15 at 21:39
  • Like this? [`(?:.*?<\/a>)|([(]?\d{3}[)]?[(\s)?.-]\d{3}[\s.-]\d{4})`](https://regex101.com/r/oL5cS0/1) – Josh Crozier Dec 16 '15 at 21:44
  • What tool/language are you using? – Bohemian Dec 16 '15 at 22:37
  • @Josh Corzier, It looks like the phone number and the anchor both match: http://www.regextester.com/?fam=93675 – Tim Dec 16 '15 at 22:40
  • @Bohemian, I was looking for a JS solution. Adding this to the question. – Tim Dec 16 '15 at 22:58
  • Are you doing this in JS in the browser? If so you can use DOM traversal to find things that are not inside certain elements. – miken32 Dec 16 '15 at 23:31
  • @miken32, Yes, I'm doing this in JS in the browser. Do you have an example to exclude certain elements? – Tim Dec 16 '15 at 23:36
  • 1
    @Tim you can just prefix your regex with the following negative lookahead: ^(?! – Unglückspilz Dec 16 '15 at 23:46
  • @Tim to follow up on miken32's suggestion, you can select all non links with document.querySelectorAll("*:not(a)") and evaluate each element's .innerText – Unglückspilz Dec 17 '15 at 00:02

2 Answers2

6

Don't use regular expressions to parse HTML. Use HTML/DOM parsers to get the text nodes (the browser can filter it down for you, to remove anchor tags and all text too short to contain a phone number for instance) and you can check the text directly.

For example, with XPath (which is a bit ugly, but has support for dealing with text nodes directly in a way most other DOM methods do not):

// This query finds all text nodes with at least 12 non-whitespace characters
// who are not direct children of an anchor tag
// Letting XPath apply basic filters dramatically reduces the number of elements
// you need to process (there are tons of short and/or pure whitespace text nodes
// in most DOMs)
var xpr = document.evaluate('descendant-or-self::text()[not(parent::A) and string-length(normalize-space(self::text())) >= 12]',
                            document.body, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null);
for (var i=0, len=xpr.snapshotLength; i < len; ++i) {
    var txt = xpr.snapshotItem(i);
    // Splits with grouping to preserve the text split on
    var numbers = txt.data.split(/([(]?\d{3}[)]?[(\s)?.-]\d{3}[\s.-]\d{4})/);
    // split will return at least three items on a hit, prefix, split match, and suffix
    if (numbers.length >= 3) {
        var parent = txt.parentNode; // Save parent before replacing child
        // Insert new elements before existing element; first element is just
        // text before first phone number
        parent.insertBefore(document.createTextNode(numbers[0]), txt);

        // Now explicitly create pairs of anchors and following text nodes
        for (var j = 1; j < numbers.length; j += 2) {
            // Operate in pairs; odd index is phone number, even is
            // text following that phone number
            var anc = document.createElement('a');
            anc.href = 'tel:' + numbers[j].replace(/\D+/g, '');
            anc.textContent = numbers[j];
            parent.insertBefore(anc, txt);
            parent.insertBefore(document.createTextNode(numbers[j+1]), txt);
        }
        // Remove original text node now that we've inserted all the
        // replacement elements and don't need it for positioning anymore
        parent.removeChild(txt);

        parent.normalize(); // Normalize whitespace after rebuilding
    }
}

For the record, the basic filters help a lot on most pages. For example, on this page, right now, as I see it (will vary by user, browser, browser extensions and scripts, etc.) without the filters, the snapshot for the query 'descendant-or-self::text()' would have 1794 items. Omitting text parented by anchor tags, 'descendant-or-self::text()[not(parent::A)]' gets it down to 1538, and the full query, verifying that the non-whitespace content is at least twelve characters long gets it down to 87 items. Applying the regex to 87 items is chump change, performance-wise, and you've removed the need to parse HTML with an unsuitable tool.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • Thanks @ShadowRanger,. This looks great. Do you have an example of how I can reinsert the number back into the DOM once it is found? – Tim Dec 18 '15 at 14:25
  • 1
    @Tim: I expanded the example to perform DOM reinsertion. – ShadowRanger Dec 22 '15 at 00:37
  • 1
    @ShadowRanger I am using your code but am facing a problem where text is getting wiped out if the phone number is next to an anchor. It seems like there is an issue with the parent.textContent = numbers[0]. I have codepen link if you can take a look to see what the issue might be https://codepen.io/d0190535/pen/BEOgZL Thanks! – overloading Apr 23 '19 at 18:27
  • 2
    @overloading: Good point. Reassigning `parent.textContent` doesn't work when the text node in question isn't the only child node of `parent` (not just anchors, but any sub-element in the middle of text), and similarly, `appendChild` doesn't work in that case since it would put the new elements after their siblings, not where the old element was. I've fixed it up to use `insertBefore` based on the original text node itself in every case, followed by `parent.removeChild(txt);` to clear it out once the new children exist. Check the edit history to see the required changes (I kept them minimal). – ShadowRanger Apr 23 '19 at 18:35
  • @ShadowRanger Works well! Thanks! – overloading Apr 23 '19 at 19:53
1

Use this as your regex:

(<a href.*?>.*?([(]?(\d{3})[)]?[(\s)?.-](\d{3})[\s.-](\d{4})).*?<\/a>)|([(]?(\d{3})[)]?[(\s)?.-](\d{3})[\s.-](\d{4}))

Use this as your replace string:

<a href="tel:$3$7$4$8$5$9">($3$7) $4$8-$5$9</a>

This finds all phone numbers, both outside and inside of href tags, however, in all cases it returns the phone number itself as specific regex groups. Therefore, you can enclose each phone number found inside new href tags, because, where they exist, you are replacing the original href tags.

A regex group or "capture group" captures a specific part of what matched the overall regex expression. They are created by enclosing part of the regex in parenthesis. These groups are numbered from left to right by order of their opening parenthesis and the part of the input they match can be reference by placing a $ in front of that number in Javascript. Other implementations use \ for this purpose. This is called a back reference. Back references can appear later in your regex expression, or in your replacement string (as done earlier in this answer). More information: http://www.regular-expressions.info/backref.html

To use a simpler example, suppose you had a document containing account numbers and other information. Each account number is proceeded by the word "account", which you want to change to "acct", but "account" appears elsewhere in the document so you cannot simply do a find and replace on it alone. You could use a regex of account ([0-9]+). In this regex, ([0-9]+) forms a group which will match the actual account number, which we can back reference as $1 in our replacement string, which becomes acct $1.

You can test this out here: http://regexr.com/

tekim
  • 151
  • 6