223

Does anyone have suggestions for detecting URLs in a set of strings?

arrayOfStrings.forEach(function(string){
  // detect URLs in strings and do something swell,
  // like creating elements with links.
});

Update: I wound up using this regex for link detection… Apparently several years later.

kLINK_DETECTION_REGEX = /(([a-z]+:\/\/)?(([a-z0-9\-]+\.)+([a-z]{2}|aero|arpa|biz|com|coop|edu|gov|info|int|jobs|mil|museum|name|nato|net|org|pro|travel|local|internal))(:[0-9]{1,5})?(\/[a-z0-9_\-\.~]+)*(\/([a-z0-9_\-\.]*)(\?[a-z0-9+_\-\.%=&]*)?)?(#[a-zA-Z0-9!$&'()*+.=-_~:@/?]*)?)(\s+|$)/gi

The full helper (with optional Handlebars support) is at gist #1654670.

David Thomas
  • 249,100
  • 51
  • 377
  • 410
arbales
  • 5,466
  • 4
  • 33
  • 40
  • 22
    It's probably not a good idea to try to list out a finite set of TLDs, since they keep creating new ones. – Maxy-B Apr 11 '13 at 13:47
  • Agree. Sometimes we need is update-able code with TLDs. Actually can be build script to append TLD into regex or dynamic code update TLDs in code. There is things in life are mean to be standardize like TLDs and Timezone. Finite control might be good to verify existing "TLDs" verifiable URL for Real World address use case. – Edward Chan JW Sep 28 '17 at 07:36
  • This doesn't appear to work without trailing slashes? ```https://www.npmjs.com/package/linkifyjs``` will fail but ```https://www.npmjs.com/package/linkifyjs/``` passes – SRR Dec 24 '21 at 15:07

16 Answers16

309

First you need a good regex that matches urls. This is hard to do. See here, here and here:

...almost anything is a valid URL. There are some punctuation rules for splitting it up. Absent any punctuation, you still have a valid URL.

Check the RFC carefully and see if you can construct an "invalid" URL. The rules are very flexible.

For example ::::: is a valid URL. The path is ":::::". A pretty stupid filename, but a valid filename.

Also, ///// is a valid URL. The netloc ("hostname") is "". The path is "///". Again, stupid. Also valid. This URL normalizes to "///" which is the equivalent.

Something like "bad://///worse/////" is perfectly valid. Dumb but valid.

Anyway, this answer is not meant to give you the best regex but rather a proof of how to do the string wrapping inside the text, with JavaScript.

OK so lets just use this one: /(https?:\/\/[^\s]+)/g

Again, this is a bad regex. It will have many false positives. However it's good enough for this example.

function urlify(text) {
  var urlRegex = /(https?:\/\/[^\s]+)/g;
  return text.replace(urlRegex, function(url) {
    return '<a href="' + url + '">' + url + '</a>';
  })
  // or alternatively
  // return text.replace(urlRegex, '<a href="$1">$1</a>')
}

var text = 'Find me at http://www.example.com and also at http://stackoverflow.com';
var html = urlify(text);

console.log(html)
// html now looks like:
// "Find me at <a href="http://www.example.com">http://www.example.com</a> and also at <a href="http://stackoverflow.com">http://stackoverflow.com</a>"

So in summary, you can try:

$('#pad dl dd').each(function(element) {
    element.innerHTML = urlify(element.innerHTML);
});
juleslasne
  • 580
  • 3
  • 22
Crescent Fresh
  • 115,249
  • 25
  • 154
  • 140
  • 12
    Some examples of the "many false positives" would greatly improve this answer. Otherwise future Googlers are just left with some (maybe valid?) FUD. – cmcculloh Jul 23 '14 at 02:41
  • 1
    I never knew you can pass function as second param for ```.replace``` :| – Aamir Afridi Jun 17 '15 at 15:44
  • 4
    It's good, but it does the "wrong" thing with trailing punctuation `text="Find me at http://www.example.com, and also at http://stackoverflow.com."` results in two 404s. Some users are aware of this and will add a space after URLs before punctuation to avoid breakage, but most linkifiers I use (Gmail, etherpad, phabricator) separate trailing punctuation from the URL. – skierpage Jul 30 '15 at 19:01
  • In case the text already contains anchored url you can use function removeAnchors(text) { var div = $('
    ').html(text); div.find('a').contents().unwrap(); return div.text(); } to first remove anchors before return text.replace
    – Muneeb Mirza Nov 27 '18 at 08:22
  • If text already contains anchored url, you are using jquery to remove anchor, but I am using Angular. How can I remove anchor in Angular ? – Sachin Jagtap May 02 '19 at 07:44
  • You can simplify this regex `/(https?:\/\/\S+)/g` – PlatypusMaximus Feb 06 '22 at 13:34
  • The second code isn't really secure because it allows script injection, instead pass the element into the function and directly replace the urls so it's safer. – AksLolCoding Nov 20 '22 at 16:12
199

Here is what I ended up using as my regex:

var urlRegex =/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/ig;

This doesn't include trailing punctuation in the URL. Crescent's function works like a charm :) so:

function linkify(text) {
    var urlRegex =/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/ig;
    return text.replace(urlRegex, function(url) {
        return '<a href="' + url + '">' + url + '</a>';
    });
}
skierpage
  • 2,514
  • 21
  • 19
Niaz Mohammed
  • 2,182
  • 1
  • 13
  • 13
  • 6
    Finally a regex that really works in most obvious case! This one deserves a bookmarking. I tested thousands examples from googles search until i find this. – Ismael Jan 16 '15 at 15:11
  • 8
    Simple and nice! But the `urlRegex` should be defined _outside_ `linkify` as compiling it is expensive. – B M Aug 19 '17 at 19:22
  • 2
    This fails to detect full URL: http://disney.wikia.com/wiki/Pua_(Moana) – Jry9972 Dec 14 '17 at 11:07
  • 1
    I added `()` in each list of characters and it works now. – Guillaume F. Mar 21 '18 at 01:06
  • 9
    it fails to detect a url beginning with just www. for ex: www.facebook.com – CraZyDroiD Oct 11 '18 at 04:44
  • 3
    @CraZyDroiD that's not a valid url, url must start with http or https – Usman Iqbal Mar 13 '20 at 11:05
  • 1
    Anchor tag is being displayed as a plain text on angular's HTML template. Can anybody tell me the reason behind it and how to solve this issue? I'm using ```linkify()``` method in a data binding like this ```{{ linkify(description) }}``` – Mayank Kataria Mar 18 '21 at 13:42
  • There appears to be a weird bug/pitfall with regex becoming stateful when using the `g` global flag: https://youtu.be/Uv4prDgyHF0 – Janosh Apr 16 '21 at 05:40
  • To see the problem enter `urlRegex.test('https://youtu.be/Uv4prDgyHF0')` -> true followed by the same again `urlRegex.test('https://youtu.be/Uv4prDgyHF0')` -> false. – Janosh Apr 16 '21 at 05:42
  • @MayankKataria did you ever figure this out? – fIwJlxSzApHEZIl Feb 15 '22 at 16:33
  • @fIwJlxSzApHEZIl I don't remember exactly. But by looking at my old code, I think I used arrow functions and in return statement I used template literals(`${url}`) instead of strings. ```linkify(text) { return text.replace(this.urlRegex, (url: string) => { console.log('url: ', url); return `${url}`; }); } ``` – Mayank Kataria Feb 16 '22 at 14:03
  • This does not work in Safari – Brad Mathews Feb 03 '23 at 19:01
64

I googled this problem for quite a while, then it occurred to me that there is an Android method, android.text.util.Linkify, that utilizes some pretty robust regexes to accomplish this. Luckily, Android is open source.

They use a few different patterns for matching different types of urls. You can find them all here: http://grepcode.com/file/repository.grepcode.com/java/ext/com.google.android/android/2.0_r1/android/text/util/Regex.java#Regex.0WEB_URL_PATTERN

If you're just concerned about url's that match the WEB_URL_PATTERN, that is, urls that conform to the RFC 1738 spec, you can use this:

/((?:(http|https|Http|Https|rtsp|Rtsp):\/\/(?:(?:[a-zA-Z0-9\$\-\_\.\+\!\*\'\(\)\,\;\?\&\=]|(?:\%[a-fA-F0-9]{2})){1,64}(?:\:(?:[a-zA-Z0-9\$\-\_\.\+\!\*\'\(\)\,\;\?\&\=]|(?:\%[a-fA-F0-9]{2})){1,25})?\@)?)?((?:(?:[a-zA-Z0-9][a-zA-Z0-9\-]{0,64}\.)+(?:(?:aero|arpa|asia|a[cdefgilmnoqrstuwxz])|(?:biz|b[abdefghijmnorstvwyz])|(?:cat|com|coop|c[acdfghiklmnoruvxyz])|d[ejkmoz]|(?:edu|e[cegrstu])|f[ijkmor]|(?:gov|g[abdefghilmnpqrstuwy])|h[kmnrtu]|(?:info|int|i[delmnoqrst])|(?:jobs|j[emop])|k[eghimnrwyz]|l[abcikrstuvy]|(?:mil|mobi|museum|m[acdghklmnopqrstuvwxyz])|(?:name|net|n[acefgilopruz])|(?:org|om)|(?:pro|p[aefghklmnrstwy])|qa|r[eouw]|s[abcdeghijklmnortuvyz]|(?:tel|travel|t[cdfghjklmnoprtvwz])|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]))|(?:(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9])))(?:\:\d{1,5})?)(\/(?:(?:[a-zA-Z0-9\;\/\?\:\@\&\=\#\~\-\.\+\!\*\'\(\)\,\_])|(?:\%[a-fA-F0-9]{2}))*)?(?:\b|$)/gi;

Here is the full text of the source:

"((?:(http|https|Http|Https|rtsp|Rtsp):\\/\\/(?:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\'\\(\\)"
+ "\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,64}(?:\\:(?:[a-zA-Z0-9\\$\\-\\_"
+ "\\.\\+\\!\\*\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,25})?\\@)?)?"
+ "((?:(?:[a-zA-Z0-9][a-zA-Z0-9\\-]{0,64}\\.)+"   // named host
+ "(?:"   // plus top level domain
+ "(?:aero|arpa|asia|a[cdefgilmnoqrstuwxz])"
+ "|(?:biz|b[abdefghijmnorstvwyz])"
+ "|(?:cat|com|coop|c[acdfghiklmnoruvxyz])"
+ "|d[ejkmoz]"
+ "|(?:edu|e[cegrstu])"
+ "|f[ijkmor]"
+ "|(?:gov|g[abdefghilmnpqrstuwy])"
+ "|h[kmnrtu]"
+ "|(?:info|int|i[delmnoqrst])"
+ "|(?:jobs|j[emop])"
+ "|k[eghimnrwyz]"
+ "|l[abcikrstuvy]"
+ "|(?:mil|mobi|museum|m[acdghklmnopqrstuvwxyz])"
+ "|(?:name|net|n[acefgilopruz])"
+ "|(?:org|om)"
+ "|(?:pro|p[aefghklmnrstwy])"
+ "|qa"
+ "|r[eouw]"
+ "|s[abcdeghijklmnortuvyz]"
+ "|(?:tel|travel|t[cdfghjklmnoprtvwz])"
+ "|u[agkmsyz]"
+ "|v[aceginu]"
+ "|w[fs]"
+ "|y[etu]"
+ "|z[amw]))"
+ "|(?:(?:25[0-5]|2[0-4]" // or ip address
+ "[0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\\.(?:25[0-5]|2[0-4][0-9]"
+ "|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(?:25[0-5]|2[0-4][0-9]|[0-1]"
+ "[0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}"
+ "|[1-9][0-9]|[0-9])))"
+ "(?:\\:\\d{1,5})?)" // plus option port number
+ "(\\/(?:(?:[a-zA-Z0-9\\;\\/\\?\\:\\@\\&\\=\\#\\~"  // plus option query params
+ "\\-\\.\\+\\!\\*\\'\\(\\)\\,\\_])|(?:\\%[a-fA-F0-9]{2}))*)?"
+ "(?:\\b|$)";

If you want to be really fancy, you can test for email addresses as well. The regex for email addresses is:

/[a-zA-Z0-9\\+\\.\\_\\%\\-]{1,256}\\@[a-zA-Z0-9][a-zA-Z0-9\\-]{0,64}(\\.[a-zA-Z0-9][a-zA-Z0-9\\-]{0,25})+/gi

PS: The top level domains supported by above regex are current as of June 2007. For an up to date list you'll need to check https://data.iana.org/TLD/tlds-alpha-by-domain.txt.

Adam
  • 12,236
  • 9
  • 39
  • 44
  • 3
    Since you have a case-insensitive regular expression, you don’t have to specify `a-zA-Z` and `http|https|Http|Https|rtsp|Rtsp`. – Ry- Dec 05 '13 at 03:06
  • 6
    This is nice, but I'm not sure I'd ever use it. For most use cases I'd rather accept some false positives than use an approach that relies on a hard-coded list of TLDs. If you list TLDs in your code, you're guaranteeing that it will be obsolete one day, and I'd rather not build mandatory future maintenance into my code if I can avoid it. – Mark Amery Mar 29 '15 at 11:10
  • 3
    This works 101% of the time, unfortunately it also finds urls that aren't preceded by a space. If i run a match on hello@mydomain.com it catches 'mydomain.com'. Is there a way to improve upon this to only catch it if it has a space before it? – Deminetix Mar 31 '15 at 05:03
  • Also to note, this is perfect for catching user entered urls – Deminetix Mar 31 '15 at 05:04
  • 1
    Note that grepcode.com is no longer up, [here](https://cs.android.com/android/platform/superproject/+/master:frameworks/base/core/java/android/util/Patterns.java;bpv=1;bpt=1;l=323) is what I _think_ is a link to the right place in the Android source code. I think the regex Android is using might be updated since 2013 (original post), but does not appear to have been updated since 2015 and may therefore be missing some newer TLDs. – James Dec 18 '19 at 19:01
  • very smart to look into the android code, thank you for posting this – Symphony0084 Jun 27 '20 at 02:00
  • Not sure where to begin, but this string: "Here is a link you can click it here: (www.google.com/asdf)." erroneously returns this as a match "www.google.com/asdf)." Perhaps that it is a valid URL, but it is obvious to a human where the URL actually ends. – Symphony0084 Jun 27 '20 at 02:05
  • iMessage's regex seems to handle the above issue correctly by only matching "www.google.com/asdf". Doubt theirs is open source though. – Symphony0084 Jun 27 '20 at 02:09
28

Based on Crescent Fresh's answer

if you want to detect links with http:// OR without http:// and by www. you can use the following:

function urlify(text) {
    var urlRegex = /(((https?:\/\/)|(www\.))[^\s]+)/g;
    //var urlRegex = /(https?:\/\/[^\s]+)/g;
    return text.replace(urlRegex, function(url,b,c) {
        var url2 = (c == 'www.') ?  'http://' +url : url;
        return '<a href="' +url2+ '" target="_blank">' + url + '</a>';
    }) 
}
Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
h0mayun
  • 3,466
  • 31
  • 40
  • This is a good solution, but I also want to check that text should not already have href in it. I tried this regex = /((?!href)((https?:\/\/)|(www\.)|(mailto:))[^\s]+)/gi but it is not working. Can you help me with it or why the above regex is not working. – Sachin Jagtap May 02 '19 at 06:26
  • I like that you've also added target="_blank" to the returned output. This version is what I wanted. Nothing too over the top (otherwise I'd use Linkifyjs) just enough to get most links. – Michael Kubler Nov 28 '19 at 10:35
  • This will match invalud urls like www.xyz – kehers Sep 17 '21 at 05:00
  • I tried all of the previous recommendations - this one worked perfectly when passing a string. – Rondakay Apr 14 '23 at 18:37
28

This library on NPM looks like it is pretty comprehensive https://www.npmjs.com/package/linkifyjs

Linkify is a small yet comprehensive JavaScript plugin for finding URLs in plain-text and converting them to HTML links. It works with all valid URLs and email addresses.

Dan Kantor
  • 431
  • 4
  • 3
  • 8
    I just got done implementing linkifyjs in my project and it's fantastic. Linkifyjs should be the answer on this question. The other one to look at is https://github.com/twitter/twitter-text – Uber Schnoz Jun 01 '17 at 20:08
7

Function can be further improved to render images as well:

function renderHTML(text) { 
    var rawText = strip(text)
    var urlRegex =/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/ig;   

    return rawText.replace(urlRegex, function(url) {   

    if ( ( url.indexOf(".jpg") > 0 ) || ( url.indexOf(".png") > 0 ) || ( url.indexOf(".gif") > 0 ) ) {
            return '<img src="' + url + '">' + '<br/>'
        } else {
            return '<a href="' + url + '">' + url + '</a>' + '<br/>'
        }
    }) 
} 

or for a thumbnail image that links to fiull size image:

return '<a href="' + url + '"><img style="width: 100px; border: 0px; -moz-border-radius: 5px; border-radius: 5px;" src="' + url + '">' + '</a>' + '<br/>'

And here is the strip() function that pre-processes the text string for uniformity by removing any existing html.

function strip(html) 
    {  
        var tmp = document.createElement("DIV"); 
        tmp.innerHTML = html; 
        var urlRegex =/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/ig;   
        return tmp.innerText.replace(urlRegex, function(url) {     
        return '\n' + url 
    })
} 
Gautam Sharma
  • 204
  • 1
  • 3
  • 11
6
let str = 'https://example.com is a great site'
str.replace(/(https?:\/\/[^\s]+)/g,"<a href='$1' target='_blank' >$1</a>")

Short Code Big Work!...

Result:-

 <a href="https://example.com" target="_blank" > https://example.com </a>
Kashan Haider
  • 1,036
  • 1
  • 13
  • 23
4

There is existing npm package: url-regex, just install it with yarn add url-regex or npm install url-regex and use as following:

const urlRegex = require('url-regex');

const replaced = 'Find me at http://www.example.com and also at http://stackoverflow.com or at google.com'
  .replace(urlRegex({strict: false}), function(url) {
     return '<a href="' + url + '">' + url + '</a>';
  });
Vedmant
  • 2,265
  • 1
  • 27
  • 36
3

Detect URLs in text and make clickable.

const detectURLInText = ( contentElement ) => {
  const elem = document.querySelector(contentElement);
      elem.innerHTML = elem.innerHTML.replace(/(https?:\/\/[^\s]+)/g, `<a class='link' href="$1">$1</a>`)
  return elem
}

detectURLInText( '#myContent');
<div id="myContent">
  Hell world!, detect URLs in text and make clickable.
  IP: https://123.0.1.890:8080  
  Web: https://any-domain.com
</div>
GMKHussain
  • 3,342
  • 1
  • 21
  • 19
2

If you want to detect links with http:// OR without http:// OR ftp OR other possible cases like removing trailing punctuation at the end, take a look at this code.

https://jsfiddle.net/AndrewKang/xtfjn8g3/

A simple way to use that is to use NPM

npm install --save url-knife
Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
Kang Andrew
  • 330
  • 2
  • 14
1

try this:

function isUrl(s) {
    if (!isUrl.rx_url) {
        // taken from https://gist.github.com/dperini/729294
        isUrl.rx_url=/^(?:(?:https?|ftp):\/\/)?(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,}))\.?)(?::\d{2,5})?(?:[/?#]\S*)?$/i;
        // valid prefixes
        isUrl.prefixes=['http:\/\/', 'https:\/\/', 'ftp:\/\/', 'www.'];
        // taken from https://w3techs.com/technologies/overview/top_level_domain/all
        isUrl.domains=['com','ru','net','org','de','jp','uk','br','pl','in','it','fr','au','info','nl','ir','cn','es','cz','kr','ua','ca','eu','biz','za','gr','co','ro','se','tw','mx','vn','tr','ch','hu','at','be','dk','tv','me','ar','no','us','sk','xyz','fi','id','cl','by','nz','il','ie','pt','kz','io','my','lt','hk','cc','sg','edu','pk','su','bg','th','top','lv','hr','pe','club','rs','ae','az','si','ph','pro','ng','tk','ee','asia','mobi'];
    }

    if (!isUrl.rx_url.test(s)) return false;
    for (let i=0; i<isUrl.prefixes.length; i++) if (s.startsWith(isUrl.prefixes[i])) return true;
    for (let i=0; i<isUrl.domains.length; i++) if (s.endsWith('.'+isUrl.domains[i]) || s.includes('.'+isUrl.domains[i]+'\/') ||s.includes('.'+isUrl.domains[i]+'?')) return true;
    return false;
}

function isEmail(s) {
    if (!isEmail.rx_email) {
        // taken from http://stackoverflow.com/a/16016476/460084
        var sQtext = '[^\\x0d\\x22\\x5c\\x80-\\xff]';
        var sDtext = '[^\\x0d\\x5b-\\x5d\\x80-\\xff]';
        var sAtom = '[^\\x00-\\x20\\x22\\x28\\x29\\x2c\\x2e\\x3a-\\x3c\\x3e\\x40\\x5b-\\x5d\\x7f-\\xff]+';
        var sQuotedPair = '\\x5c[\\x00-\\x7f]';
        var sDomainLiteral = '\\x5b(' + sDtext + '|' + sQuotedPair + ')*\\x5d';
        var sQuotedString = '\\x22(' + sQtext + '|' + sQuotedPair + ')*\\x22';
        var sDomain_ref = sAtom;
        var sSubDomain = '(' + sDomain_ref + '|' + sDomainLiteral + ')';
        var sWord = '(' + sAtom + '|' + sQuotedString + ')';
        var sDomain = sSubDomain + '(\\x2e' + sSubDomain + ')*';
        var sLocalPart = sWord + '(\\x2e' + sWord + ')*';
        var sAddrSpec = sLocalPart + '\\x40' + sDomain; // complete RFC822 email address spec
        var sValidEmail = '^' + sAddrSpec + '$'; // as whole string

        isEmail.rx_email = new RegExp(sValidEmail);
    }

    return isEmail.rx_email.test(s);
}

will also recognize urls such as google.com , http://www.google.bla , http://google.bla , www.google.bla but not google.bla

kofifus
  • 17,260
  • 17
  • 99
  • 173
1

You can use a regex like this to extract normal url patterns.

(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})

If you need more sophisticated patterns, use a library like this.

https://www.npmjs.com/package/pattern-dreamer

Kang Andrew
  • 330
  • 2
  • 14
  • 1
    What's the purpose of `(?:www\.|(?!www))`? Why should `wwwww.com` be invalid? – Toto Jul 05 '19 at 09:31
  • You are right. Actually I just took it as many use the regex. I'd recommend using the linked library above. We should consider many cases in url detection, so the regex should be more complicated. – Kang Andrew Jul 08 '19 at 05:56
1

Generic Object Oriented Solution

For people like me that use frameworks like angular that don't allow manipulating DOM directly, I created a function that takes a string and returns an array of url/plainText objects that can be used to create any UI representation that you want.

URL regex

For URL matching I used (slightly adapted) h0mayun regex: /(?:(?:https?:\/\/)|(?:www\.))[^\s]+/g

My function also drops punctuation characters from the end of a URL like . and , that I believe more often will be actual punctuation than a legit URL ending (but it could be! This is not rigorous science as other answers explain well) For that I apply the following regex onto matched URLs /^(.+?)([.,?!'"]*)$/.

Typescript code

    export function urlMatcherInText(inputString: string): UrlMatcherResult[] {
        if (! inputString) return [];

        const results: UrlMatcherResult[] = [];

        function addText(text: string) {
            if (! text) return;

            const result = new UrlMatcherResult();
            result.type = 'text';
            result.value = text;
            results.push(result);
        }

        function addUrl(url: string) {
            if (! url) return;

            const result = new UrlMatcherResult();
            result.type = 'url';
            result.value = url;
            results.push(result);
        }

        const findUrlRegex = /(?:(?:https?:\/\/)|(?:www\.))[^\s]+/g;
        const cleanUrlRegex = /^(.+?)([.,?!'"]*)$/;

        let match: RegExpExecArray;
        let indexOfStartOfString = 0;

        do {
            match = findUrlRegex.exec(inputString);

            if (match) {
                const text = inputString.substr(indexOfStartOfString, match.index - indexOfStartOfString);
                addText(text);

                var dirtyUrl = match[0];
                var urlDirtyMatch = cleanUrlRegex.exec(dirtyUrl);
                addUrl(urlDirtyMatch[1]);
                addText(urlDirtyMatch[2]);

                indexOfStartOfString = match.index + dirtyUrl.length;
            }
        }
        while (match);

        const remainingText = inputString.substr(indexOfStartOfString, inputString.length - indexOfStartOfString);
        addText(remainingText);

        return results;
    }

    export class UrlMatcherResult {
        public type: 'url' | 'text'
        public value: string
    }
eddyP23
  • 6,420
  • 7
  • 49
  • 87
1

Here is a little solution for react app without using any library please note that this method work if the url is not attached to any character

this component will return a paragraph with kink detection !

import React from "react";


interface Props {
    paragraph: string,
}

const REGEX = /^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$/gm;

const Paragraph: React.FC<Props> = ({ paragraph }) => {
  
    const paragraphArray = paragraph.split(' ');
    return <div>

        {
            paragraphArray.map((word: any) => {
                return word.match(REGEX) ? (
                    <>
                        <a href={word} className="text-blue-400">{word}</a> {' '}
                    </>
                ) : word + ' '
            })
        }
    </div>;
};

export default LinkParaGraph;



1

There is a problem with other people's answers, for example, for those who want to get the text in an event to test if there are URLs (in the case of messaging applications, for example).

Example:

The regex presented here would return https:// only, or also just https://jeankassio

As this was my case, and I couldn't find satisfactory answers, I decided to create my Regex with my average knowledge on the subject, and I arrived at the following result.

/(http|https):\/\/([^.]+[\.][\S]+)/

Explaining the Regex:

He will get:

  • First the HTTP or HTTPS;
  • then characters before the first dot;
  • then capture the point;
  • only after capturing the point, after inserting some more characters will it be captured in the Regex.

This way, it makes it easier for programmers who want to use this Regex in real-time events.

OR ->

/(http|https):\/\/([^.]+[\.][\S]+(\s))/

This Regex will capture only after inserting a space after the link, which might be better for real time events

Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
0

tmp.innerText is undefined. You should use tmp.innerHTML

function strip(html) 
    {  
        var tmp = document.createElement("DIV"); 
        tmp.innerHTML = html; 
        var urlRegex =/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/ig;   
        return tmp.innerHTML .replace(urlRegex, function(url) {     
        return '\n' + url 
    })