Find Any URL in text string exactly like Twitter Uses

Question

There are many similar questions, however they don't answer the problem of a url not having www., http://, etc. What I'm looking to do is check whether or not a string contains a url with ANY type of url. Twitter does this when you submit a Tweet.

Acceptable URLs would include, but not be limited to:

Two Regex expressions I've tried from Daring Fireball & This question:

var regex = /\b((?:https?:\/\/|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\"\\.,<>?\u00AB\u00BB\u201C\u201D\u2018\u2019]))/i;

var regex = /(?:<\w+.*?>|[^=!:'"\/]|^)((?:https?:\/\/|www\.)[-\w]+(?:\.[-\w]+)*(?::\d+)?(?:\/(?:(?:[~\w\+%-]|(?:[,.;@:][^\s$]))+)?)*(?:\?[\w\+%&=.;:-]+)?(?:\#[\w\-\.]*)?)(?:\p{P}|\s|<|$)/;

Here is an example of the testing I'm doing: http://jsfiddle.net/3Wn26/5/

What problems have you encountered with the two examples you've tried? — chrisfrancis27, Jun 18 '12 at 21:21
@ChrisFrancis I've updated the question with an example: http://jsfiddle.net/3Wn26/5/ — stwhite, Jun 18 '12 at 21:26
@stwhite I answered a similar question here: http://stackoverflow.com/questions/10505456/regex-to-convert-url-to-links/10505843#10505843. In summary, if you want to remove prefix constraints like "www" then most likely you'll to have to add suffix constraints like ``(com|org|co.uk|co.jp)$``; otherwise all sorts of nonsense may pass as "links." Consider it this way: suppose in a time and land far, far away, anyone with enough money can buy any TLD like "google.app" and "amazon.music" (...oh wait). If you, a human, can't tell whether "lol.cats" is a domain or a typo, then neither can a computer! — Andrew Cheong, Jun 18 '12 at 21:59
@acheong87 I tried your regex but it returns a few invalid results: http://jsfiddle.net/3Wn26/7/ Also, "3.141593", "omg...really" shouldn't be considered URLS. I don't believe a URL can have either Numbers for TLD or consecutive "...". — stwhite, Jun 18 '12 at 22:10
@stwhite - Ah, the regex was by someone else--I was merely guiding them as to how to modify it. What I meant to point out though, is exactly what you're saying--that nonsense strings shouldn't be URLs--so okay, you can tell the numbers and consecutive dots are wrong, but what about omg.lol? Without a list of TLD suffixes, you can't really know... — Andrew Cheong, Jun 19 '12 at 01:43
I think it would have helped if I had shed some context on where this input was to be drawn from. In this case it's from a search input—the user will be inputting urls to search. If they type "omg.lol" it just wont have search results for it. Really do appreciate your suggestions though! — stwhite, Jun 19 '12 at 20:50

ohaal · Accepted Answer · 2013-06-16T11:57:56.680

4

I don't think there's a good way to do this reliably (over time). Now that the new gTLDs are coming, it's going to be hard to keep up. Anyway, I gave it a shot.

/
  (
    \b
      (?:(https?|ftp):\/\/)?
      (
        (?:www\d{0,3}\.)?
        (
          [a-z0-9.-]+\.
          (?:[a-z]{2,4}|museum|travel)
          (?:\/[^\/\s]+)*
        )
      )
    \b
  )
/ix

Capture groups

The entire URL, ex: http://www.google.com/anyquerystringSAY/Rfy/srA/yh
The protocol, ex: http
URL including www., ex: www.google.com/swrua8rua8rUWRWAURHAJSrjuhFAhjT/Rtgfsbdh
URL excluding www., ex: google.com/sarwar8wa8r/R/A(R8 or images.google.com/w9r89w9ar8a9sjfriJRIUS(RY/(YUr

Optionally, you can replace the (?:[a-z]{2,4}|museum|travel) bit with all the ones listed here, but that list is never going to stop growing, so I doubt it's worth it. (You can see I added the two exceptions museum and travel.)

Also notice I added ftp, feel free to remove that if you don't need it.

Hope this helps.

See it in action

edited Jun 16 '13 at 11:57

answered Jun 18 '12 at 22:05

ohaal

5,208
2
34
53

nice! So far no problems with your regex. Just want to make sure to test for any failures. Do you know any break points that this regex will have besides what you've mentioned? – stwhite Jun 18 '12 at 22:12
Well, it can misinterpret TLDs if you don't add all the TLDs manually, but that's the only one I can think of, other than that I'm not sure. – ohaal Jun 18 '12 at 22:15
I'm ok with the URLs inside of brackets. This text will be in a search input. No paramaters of the search accept the bracket anyhow. The illegal characters I am curious about but I don't believe I need to because I'll be using this regex to determine if the string is a url, then processing only the base domain (http and www will be stripped from the string). – stwhite Jun 18 '12 at 22:19
I see, didn't realize you wanted the protocol stripped. It can be important. Anyway, I'll update the capture groups to capture everything in the first group, protocol in the second, URL with `www.` included in the third, and the URL in the fourth (ie. `images.google.com` and `google.com/q=234`) – ohaal Jun 18 '12 at 22:23
I don't want to strip them for the regex check. I'll be removing them after I determine if the string was a URL or not. Your Regex seems to work great so far. – stwhite Jun 18 '12 at 22:25
1

@stwhite: There are now 4 capture groups, [see it on rubular.](http://rubular.com/r/dwbmNqDF2g) – ohaal Jun 18 '12 at 22:29

score 1 · Answer 2 · answered Jun 18 '12 at 21:50

(# Scheme
 [a-z][a-z0-9+\-.]*:
 (# Authority & path
  //
  ([a-z0-9\-._~%!$&'()*+,;=]+@)?              # User
  ([a-z0-9\-._~%]+                            # Named host
  |\[[a-f0-9:.]+\]                            # IPv6 host
  |\[v[a-f0-9][a-z0-9\-._~%!$&'()*+,;=:]+\])  # IPvFuture host
  (:[0-9]+)?                                  # Port
  (/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/?          # Path
 |# Path without authority
  (/?[a-z0-9\-._~%!$&'()*+,;=:@]+(/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/?)?
 )
|# Relative URL (no scheme or authority)
 ([a-z0-9\-._~%!$&'()*+,;=@]+(/[a-z0-9\-._~%!$&'()*+,;=:@]+)*/?  # Relative path
 |(/[a-z0-9\-._~%!$&'()*+,;=:@]+)+/?)                            # Absolute path
)
# Query
(\?[a-z0-9\-._~%!$&'()*+,;=:@/?]*)?
# Fragment
(\#[a-z0-9\-._~%!$&'()*+,;=:@/?]*)?

RFC 3986. Validate if a string holds a URL as specified in RFC 3986. Both absolute and relative URLs are supported.

Was this tested on the example I gave? Because I just ran this on the example provided in the question and it accepts all of those URLs—it should not. => http://jsfiddle.net/3Wn26/6/ — stwhite, Jun 18 '12 at 21:57
Yes, They match the 5 given url's as specified in the question. Which url's should not match? I don't see any specified — buckley, Jun 18 '12 at 22:00
I posted the JSFiddle snippet in the previous comment but here it is again: http://jsfiddle.net/3Wn26/6. These strings are not urls and could not be "google", "google", "comgoogle", "googlecom". — stwhite, Jun 18 '12 at 22:06

score 0 · Answer 3 · answered Jun 19 '12 at 07:58

The answer is - you can't.

Twitter, for example treats the name of the singer Will.I.Am as a URL (.am is a valid tld).

Without knowing all the domain registration rules at every tld, there's no way of knowing if a URL is valid without testing.

Here is what I propose you do.

Be generous with your script. Accept almost any string with a "." in it.
Perform an HTTP HEAD request to see whether the URL exists.
Do a WHOIS to see if the domain has been registered (even if the exact URL doens't match)

Of course, this doesn't take in to account that someone may have posted a link to their Intranet - which would work for some of their followers.

score 0 · Answer 4 · answered Nov 24 '12 at 00:38

My simple JavaScript library called FuncJS has a function called "findLinks()" which should be able to get done what you're wanting.

Say that you have a string with links inside it, simply include in the function parameters, like this:

findLinks("Visit my website at http://website.com and visit my profile on Twitter at http://twitter.com/yourProfile!");

And then output it using various methods, such as document.write and the string should be outputted with links highlighted.

For a greater understanding of this function, please read the documentation at http://docs.funcjs.webege.com/findLinks().html.

Hope this helps you out and anyone else wanting to do this! :)

Find Any URL in text string exactly like Twitter Uses

4 Answers4

Capture groups

See it in action

Linked