1

In February John Gruber updated his URL regex pattern once again. I'm trying to get it to work in Javascript, but haven't had any luck so far. I have read through the popular answers on SO to find a solution, like for example this one or the one generally discouraging the use of "manual" solutions like this.

I removed the mode modifier, but it still does nothing:

var exp = /\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))/ig;

var autolinked_text = text.replace(exp, "<a href=\"$1\" rel=\"nofollow\">$1</a>");

I tried escaping the forward slashes, but that didn't do it, so I reverted that for the code I pasted here, in order not to screw things up myself.

Update:

I also posted this question because I suspect many people are trying to solve the same problem. Gruber's (old, 2010) regex is quite popular and I thought it might be a good idea to have an answer documented on SO for the 2014 update.

Update 2:

I was asked to post the version I tried with the slashes escaped. Here it is:

var exp = /\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b\/?(?!@)))/ig;

This gives me Uncaught SyntaxError: Invalid regular expression: /#<error>/: Invalid group in Chrome Dev Tools

Community
  • 1
  • 1
oelna
  • 2,210
  • 3
  • 22
  • 40
  • Your regex looks very different from what's on the linked blog. What do I miss ? – Denys Séguret Jun 10 '14 at 12:09
  • 2
    Is there a reason to do such strict checking? Such a complicated regex could break anytime. Especially since more TLD options are becoming available. – MarioDS Jun 10 '14 at 12:09
  • @MDeSchaepmeester: It seems clear to me based on his code that he simply wants to convert any text URLs to actual links. Any less strict would probably cause alot of false positives. I do agree however that with the new gTLDs, doing this is much harder. – ohaal Jun 10 '14 at 12:12
  • @dystroy I used the 2014 updated version. For the old version there are plenty of answers available already. Could you take a look? It's liked in the first paragraph of the article or on [github](https://gist.github.com/gruber/8891611). – oelna Jun 10 '14 at 12:12
  • @MDeSchaepmeester I know there are probably better (and more complicated) ways of linking urls via Javascript. But as I only need it for a non-crucial part of a website, I'd like to use a small-ish version that works *most of the time*. Gruber's solution is often used and seemed a better starting point than to try and create my own regex. – oelna Jun 10 '14 at 12:15
  • In your code above there is an unescaped "\". In a deleted answer this is mentioned. If that really is the problem then someone should just write "You have and unescaped \" as the answer. If it's simply a typo on SO then please fix it so that people know that the problem is somewhere else. – slebetman Jun 10 '14 at 12:21
  • @slebetman As I said, I tried escaping the / myself, but it didn't fix it for me. If you post an answer with the new (escaped) regex and it works, I will of course accept it. – oelna Jun 10 '14 at 12:22
  • Please post the escaped / that didn't work for you instead of the code above because as is it is very misleading. – slebetman Jun 10 '14 at 12:24
  • @slebetman I updated my question. Chrome Dev Tools now outputs `Invalid regular expression: /#/: Invalid group`. – oelna Jun 10 '14 at 12:32
  • 3
    That re has a negative-lookbehind: **`(?<!@)`** which is not supported in JavaScript and is misinterpreted as a malformed capture group. – Alex K. Jun 10 '14 at 12:42
  • @AlexK. Can this be fixed? Or would you say it's not worth the trouble and I should use a totally different expression? – oelna Jun 10 '14 at 12:53
  • You would need to read through the regex and figure out what that token is doing in the context of its surrounding logic and decide how to delete it without breaking anything. Good luck! – Alex K. Jun 10 '14 at 12:56
  • 1
    Personally, I would have a simple RE to test for basic validity then another lookup entirely for the TLD. – Alex K. Jun 10 '14 at 12:57

1 Answers1

0

Javascript does not support lookbehind (?<) source

So... you may want to try another pattern (I'm sorry I can't help you there if you want that level of restriction), or perform the test with another Regex implementation, such as PHP's PCRE.

Camille Hodoul
  • 376
  • 2
  • 13