Regex to match URL

Question

I am using the following regex to match a URL:

$search  = "/([\S]+\.(MUSEUM|TRAVEL|AERO|ARPA|ASIA|COOP|INFO|NAME|BIZ|CAT|COM|INT|JOBS|NET|ORG|PRO|TEL|AC|AD|AE|AF|AG|AI|AL|AM|AN|AO|AQ|AR|AS|AT|AU|au|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BJ|BL|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|CR|CU|CV|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EDU|EE|EG|EH|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GOV|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|IO|IQ|IR|IS|IT|JE|JM|JO|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MF|MG|MH|MIL|MK|ML|MM|MN|MO|MOBI|MP|MQ|MR|MS|MT|MU|MV|MW|MX|MY|MZ|NA|NC|NE|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SY|SZ|TC|TD|TF|TG|TH|TJ|TK|TL|TM|TN|TO|R|H|TP|TR|TT|TV|TW|TZ|UA|UG|UK|UM|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|YE|YT|YU|ZA|ZM|ZW)([\S]*))/i";

But its a bit screwed up because it also matches "abc.php" which I dont want. and something like abc...test. I want it to match abc.com though. and www.abc.com as well as http://abc.com.

It just needs a slight tweak at the end but I am not sure what. (there should be a slash after the any domain name which it is not checking for right now and it is only checking \S)

thank you for your time.

Boldewyn · Accepted Answer · 2009-07-17T10:37:05.543

21

$search  = "#^((?#
    the scheme:
  )(?:https?://)(?#
    second level domains and beyond:
  )(?:[\S]+\.)+((?#
    top level domains:
  )MUSEUM|TRAVEL|AERO|ARPA|ASIA|EDU|GOV|MIL|MOBI|(?#
  )COOP|INFO|NAME|BIZ|CAT|COM|INT|JOBS|NET|ORG|PRO|TEL|(?#
  )A[CDEFGILMNOQRSTUWXZ]|B[ABDEFGHIJLMNORSTVWYZ]|(?#
  )C[ACDFGHIKLMNORUVXYZ]|D[EJKMOZ]|(?#
  )E[CEGHRSTU]|F[IJKMOR]|G[ABDEFGHILMNPQRSTUWY]|(?#
  )H[KMNRTU]|I[DELMNOQRST]|J[EMOP]|(?#
  )K[EGHIMNPRWYZ]|L[ABCIKRSTUVY]|M[ACDEFGHKLMNOPQRSTUVWXYZ]|(?#
  )N[ACEFGILOPRUZ]|OM|P[AEFGHKLMNRSTWY]|QA|R[EOSUW]|(?#
  )S[ABCDEGHIJKLMNORTUVYZ]|T[CDFGHJKLMNOPRTVWZ]|(?#
  )U[AGKMSYZ]|V[ACEGINU]|W[FS]|Y[ETU]|Z[AMW])(?#
    the path, can be there or not:
  )(/[a-z0-9\._/~%\-\+&\#\?!=\(\)@]*)?)$#i";

Just cleaned up a bit. This will match only HTTP(s) addresses, and, as long as you copied all top level domains correctly from IANA, only those standardized (it will not match http://localhost) and with the http:// declared.

Finally you should end with the path part, that will always start with a /, if it is there.

However, I'd suggest to follow Cerebrus: If you're not sure about this, learn regexps in a more gentle way and use proven patterns for complicated tasks.

Cheers,

By the way: Your regexp will also match something.r and something.h (between |TO| and |TR| in your example). I left them out in my version, as I guess it was a typo.

On re-reading the question: Change

  )(?:https?://)(?#

to

  )(?:https?://)?(?#

(there is a ? extra) to match 'URLs' without the scheme.

edited Jul 17 '09 at 10:37

answered Jul 17 '09 at 08:07

Boldewyn

81,211
44
156
212

1

but i dont want the http:// in the beginning to compulsory. as i want it to match "abc.com" also. – Alec Smart Jul 17 '09 at 08:11
can you please improve [\S]* to probably no spaces + only words + only numbers or whatever that is allowed in a URL? – Alec Smart Jul 17 '09 at 10:20
\S should never match spaces... I updated it to what Wikipedia http://en.wikipedia.org/wiki/How_to_edit#Links_and_URLs allows in it's URLs. That looks reasonable. – Boldewyn Jul 17 '09 at 10:40
:-) Yeah, and unfortunately all the other generic TLDs. This will make automated link detection without natural language processing near impossible... – Boldewyn Sep 09 '11 at 08:41
Yeah I doubt this will catch all the open domain tlds like big.wong – Eddie Apr 11 '16 at 22:37
@Eddie for your consideration: the answer is from '09, the comment from '11. In natural language you’d want to detect IRIs as well, which will come dangerously close to the useless `/\w+\.\w+/u`. What will we do then? Last resort is natural language processing and trying to parse the text to get a grasp of the meaning. – Boldewyn Apr 12 '16 at 07:55
@Boldewyn Tell me, please, by which process/tooling did you construct the "MUSEUM"…"Z[AMW]" portion? I presume it was not by hand? It's quite impressive. – James Cropcho Nov 08 '22 at 21:23
It would be nice for the regexp to also enforce maximum length(s) for the domain name, vis-à-vis https://webmasters.stackexchange.com/q/16996 – James Cropcho Nov 08 '22 at 21:46
When I compiled the list in 2009 there were no new-style generic TLDs like .google or .books, so, yes, I compiled it by hand. This means, too, that the regexp is now outdated, since it won’t match any of the newer TLDs. I’d suggest taking a pragmatic `\.[a-z]+` approach instead of trying to stay on top of newly defined TLDs. About the max-length: replace `+` with `{1,63}`. That does the `label` part. The 253 char total part needs to be done separately, though. – Boldewyn Nov 08 '22 at 22:50

score 12 · Answer 2 · answered Aug 18 '10 at 02:23

12

Not exactly what the OP asked for but this is a much simpler regular expression that does not need to be updated each time the IANA introduces a new TLD. I believe this is more adequate for most simple needs:

^(?:https?://)?(?:[\w]+\.)(?:\.?[\w]{2,})+$

no list of TLD, localhost is not matched, the number of subparts must be >= 2 and the length of each subpart must be >= 2 (fx: "a.a" will not match but "a.ab" will match).

answered Aug 18 '10 at 02:23

Diego Perini

129
1
2

So this does not match the path & query param part of url? – lulalala May 31 '13 at 03:43
1

Also fails to match hyphens in the URL. – Styphon Oct 28 '14 at 16:49
1

You need to escape slashes in `https?://` but still it's too broad. You can test it here: http://www.regexr.com/ – ahmd0 Feb 09 '15 at 19:31
It doesn't seem to match subdomains e.g. https://consent.cookiebot.com – VilladsR Jun 14 '23 at 08:43

score 8 · Answer 3 · answered Jul 18 '13 at 22:28

8

This question was surprisingly difficult to find an answer for. The regexes I found were too complicated to understand, and anything more that a regex is overkill and too difficult to implement.

Finally came up with:

/(\S+\.(com|net|org|edu|gov)(\/\S+)?)/

Works with http://example.com, https://example.com, example.com, http://example.com/foo.

Explanation:

Looks for .com, etc.
Matches everything before it up to the space
Matches everything after it up to the space

answered Jul 18 '13 at 22:28

B Seven

44,484
66
240
385

1

That will also match if a string like ".com" occurs but is not part of the domain, like in "http://example.zork/foo/.com/bar", and omits all the county-specific top-level domains (like .uk, .ca, etc) and others. – TextGeek Jan 16 '20 at 21:24
1

+1 for letting you choose which domains you'd like to accept, although I would make sure to point that out. . . adding a word boundary (\b) after the domains prevents hits that match the domain but keep extending, like example.commerce or example.governance . . /(\S+\.(com|net|org|edu|gov)\b(\/\S+)?)/ – Luigi Jan 06 '23 at 15:58

score 6 · Answer 4 · edited Jan 22 '18 at 00:52

6

This will get any url in its entirety, including ?= and #/ if they exist:

/[A-Za-z]+:\/\/[A-Za-z0-9\-_]+\.[A-Za-z0-9\-_:%&;\?\#\/.=]+/g

edited Jan 22 '18 at 00:52

Sebastián Palma

32,692
6
40
59

answered Jul 29 '15 at 15:01

Miko Trueman

61
1
2

Also matches `hap://foo.com/` :) – stelios Aug 11 '18 at 10:02
This omits a few permitted characters, such as apostrophe, !, double-quote, and plus; and % should only be allowed if followed by 2 hex digits. Not to mention internationalized URIS (IRIs)/. – TextGeek Jan 16 '20 at 21:13

score 1 · Answer 5 · answered Jul 17 '09 at 07:39

1

Using a single regexp to match an URL string makes the code incredible unreadable. I'd suggest to use parse_url to split the URL into its components (which is not a trivial task), and check each part with a regexp.

answered Jul 17 '09 at 07:39

Bluehorn

2,956
2
22
29

score 1 · Answer 6 · edited Aug 22 '11 at 12:15

1

Changing the end of the regex to (/\S*)?)$ should solve your problem.

To explain what that is doing -

it is looking for / followed by some characters (not whitespace)
this match is optional, ? indicated 0 or 1 times
and finally it should be followed by a end of string (or change it to \b for matching on a word boundary).

edited Aug 22 '11 at 12:15

axel22

32,045
9
125
137

answered Jul 17 '09 at 07:45

benophobia

771
6
8

score 1 · Answer 7 · answered Nov 05 '13 at 00:52

1

I think this is simple and efficient /^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/

answered Nov 05 '13 at 00:52

aminhotob

1,056
14
17

score 0 · Answer 8 · answered Jul 17 '09 at 07:51

0

$ : The dollar signifies the end of the string.
For example \d*$ will match strings which end with a digit. So you need to add the $!

answered Jul 17 '09 at 07:51

Matthieu

2,743
19
21

Jerry · Answer 9 · 2013-02-18T15:45:00.640

Regex to match all urls (with www, without www, with http or https, without http or https, includes all 2-6 letter top level domain names [for countries, ex 'ly','us'], ports, query strings, and anchors ['#']). It's not 100% but it is better than anything I have seen posted on the web.

It uses the top level domains from the first answer, combined with other techniques found in my searches. It will return any valid url that has bounds, that is where \b comes into play. Since the trailing '/' is also triggered by \b, the last one, is a match for one or more '?'.

/\b((http(s?):\/\/)?([a-z0-9\-]+\.)+(MUSEUM|TRAVEL|AERO|ARPA|ASIA|EDU|GOV|MIL|MOBI|COOP|INFO|NAME|BIZ|CAT|COM|INT|JOBS|NET|ORG|PRO|TEL|A[CDEFGILMNOQRSTUWXZ]|B[ABDEFGHIJLMNORSTVWYZ]|C[ACDFGHIKLMNORUVXYZ]|D[EJKMOZ]|E[CEGHRSTU]|F[IJKMOR]|G[ABDEFGHILMNPQRSTUWY]|H[KMNRTU]|I[DELMNOQRST]|J[EMOP]|K[EGHIMNPRWYZ]|L[ABCIKRSTUVY]|M[ACDEFGHKLMNOPQRSTUVWXYZ]|N[ACEFGILOPRUZ]|OM|P[AEFGHKLMNRSTWY]|QA|R[EOSUW]|S[ABCDEGHIJKLMNORTUVYZ]|T[CDFGHJKLMNOPRTVWZ]|U[AGKMSYZ]|V[ACEGINU]|W[FS]|Y[ETU]|Z[AMW])(:[0-9]{1,5})?((\/([a-z0-9_\-\.~]*)*)?((\/)?\?[a-z0-9+_\-\.%=&amp;]*)?)?(#[a-zA-Z0-9!$&'()*+.=-_~:@/?]*)?)/gi

What does "/?" mean near the end of the regex? Did you mean "\/?" — , Apr 01 '13 at 18:15
Doesn't appear to work for things like "http://s3.amazonaws.com/plivocloud/4c743546-7e1b-11e2-9060-002590662312.mp3" — , Apr 01 '13 at 18:23

score 0 · Answer 10 · answered May 31 '13 at 15:02

This is THE ONE:

_^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$_iuS

score 0 · Answer 11 · answered Mar 14 '15 at 06:05

0

Try Regexy::Web::Url

r = Regexy::Web::Url.new # matches 'http://foo.com', 'www.foo.com' and 'foo.com'

answered Mar 14 '15 at 06:05

pragma

1,290
14
16

score -1 · Answer 12 · edited Apr 25 '17 at 14:17

-1

[ftp:\/\/www\/.-https:\/\/-http:\/\/][a-zA-Z0-9u00a1-uffff0]{1,3}[^ ]{1,1000}

This works fine for me in js

var regex = new RegExp('[ftp:\/\/www\/.-https:\/\/-http:\/\/][a-zA-Z0-9u00a1-uffff0]{1,3}[^ ]{1,1000}');
regex.exec('https://www.youtube.com/watch?v=FM7MFYoylVs&feature=youtu.be&t=20s');

edited Apr 25 '17 at 14:17

Christoph

50,121
21
99
128

answered Apr 25 '17 at 09:42

keshav gaur

1

1

can you format your answer better? It's very difficult to understand. – Felix Haeberle Apr 25 '17 at 10:35

score -2 · Answer 13 · answered Jul 25 '12 at 19:21

Just to add to things. I know this doesn't fully and directly answer this specific question, but it's the best place I can find to add this info. I wrote a jQuery plug a while back to match urls for similar purpose, however at current state (will update it as time goes on) it will still consider addresses like 'http://abc.php' as valid. However, if there is no http, https, or ftp at url start, it will not return 'valid'. Though I should clarify, this jQuery method returns an object and not just one string or boolean. The object breaks things down and among the breakdown is a .valid boolean. See the full fiddle and test in the link at bottom. If you simply wanna grab the plugin and go, see below:

jQuery Plugin

(function($){$.matchUrl||$.extend({matchUrl:function(c){var b=void 0,d="url,,scheme,,authority,path,,query,,fragment".split(","),e=/^(([^\:\/\?\#]+)\:)?(\/\/([^\/\?\#]*))?([^\?\#]*)(\?([^\#]*))?(\#(.*))?/,a={url:void 0,scheme:void 0,authority:void 0,path:void 0,query:void 0,fragment:void 0,valid:!1};"string"===typeof c&&""!=c&&(b=c.match(e));if("object"===typeof b)for(x in b)d[x]&&""!=d[x]&&(a[d[x]]=b[x]);a.scheme&&a.authority&&(a.valid=!0);return a}});})(jQuery);

jsFiddle with example:

http://jsfiddle.net/SpYk3/e4Ank/

score -3 · Answer 14 · edited Aug 22 '11 at 12:14

-3

(http|www)\S+

Just use this regex to match all url's

edited Aug 22 '11 at 12:14

axel22

32,045
9
125
137

answered Aug 22 '11 at 09:58

Nibalkar

51
1
1

10

This is a really bad regular expression. I can't believe people actually voted for it. It is bad because it will also match completely invalid `httpcheese` as a valid url. – Stefan Arentz Nov 15 '12 at 20:59

Regex to match URL

14 Answers14

Linked

Related