36

I'm not very good at regular expressions at all.

I've been using a lot of framework code to date, but I'm unable to find one that is able to match a URL like http://www.example.com/etcetc, but it is also is able to catch something like www.example.com/etcetc and example.com/etcetc.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Edmund Rojas
  • 6,376
  • 16
  • 61
  • 92
  • This question may help you. http://stackoverflow.com/questions/1141848/regex-to-match-url – Wiseguy Jun 21 '11 at 15:12
  • Possible duplicate of [url regex without http://www.](http://stackoverflow.com/questions/3310216/url-regex-without-http-www) – Balanivash Jun 21 '11 at 15:12
  • the first two options can be matched, but matching your last one `example.com/etcetc` is going to be virtually impossible. You'd need to basically just match anything with a dot in the middle. – Spudley Jun 21 '11 at 15:15
  • 1
    @Balanivash - a bit harsh to mark as a duplicate of a question that got closed. – Spudley Jun 21 '11 at 15:16
  • Like I was answering questions like this till yesterday, but was asked to mark as duplicates if any such question existed today, thats why did it. – Balanivash Jun 21 '11 at 15:18
  • A canonical question is *[How can I split a URL string up into separate parts in Python?](https://stackoverflow.com/questions/449775/)* (2009). – Peter Mortensen Nov 28 '22 at 02:34

13 Answers13

54

For matching all kinds of URLs, the following code should work:

<?php
    $regex = "((https?|ftp)://)?"; // SCHEME
    $regex .= "([a-z0-9+!*(),;?&=$_.-]+(:[a-z0-9+!*(),;?&=$_.-]+)?@)?"; // User and Pass
    $regex .= "([a-z0-9\-\.]*)\.(([a-z]{2,4})|([0-9]{1,3}\.([0-9]{1,3})\.([0-9]{1,3})))"; // Host or IP address
    $regex .= "(:[0-9]{2,5})?"; // Port
    $regex .= "(/([a-z0-9+$_%-]\.?)+)*/?"; // Path
    $regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+/$_.-]*)?"; // GET Query
    $regex .= "(#[a-z_.-][a-z0-9+$%_.-]*)?"; // Anchor
?>

Then, the correct way to check against the regex is as follows:

<?php
   if(preg_match("~^$regex$~i", 'www.example.com/etcetc', $m))
      var_dump($m);

   if(preg_match("~^$regex$~i", 'http://www.example.com/etcetc', $m))
      var_dump($m);
?>

Courtesy: Comments made by splattermania in the PHP manual: preg_match

RegEx Demo in regex101

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • I tried using this in a preg_match_all and though it returned no errors it didnt seem to catch any urls – Edmund Rojas Jun 21 '11 at 15:32
  • not sure what Im doing wrong, but I cant get this to work like this preg_match_all("/^$regex$/", $text, $matches, PREG_PATTERN_ORDER); – Edmund Rojas Jun 21 '11 at 15:54
  • @edmund: Even with above match works fine, you can see code demo at: http://ideone.com/Xk2Ek (Last one is your function call). – anubhava Jun 21 '11 at 15:59
  • interesting, I am able to get the code to run when its not integrated into my code, so Ill have to fiddle with it and find where its going wrong, thanks again, appreciate the help! – Edmund Rojas Jun 21 '11 at 16:11
  • actually I tried ideone and posted the result here, is there anything really obvious to you that Ive done wrong? http://ideone.com/CxTAJ – Edmund Rojas Jun 21 '11 at 16:15
  • @Edmund: There was minor issue in your code since you were using `^` and `$` in matching URLs. Pls check your modified working code: http://ideone.com/bBsvW – anubhava Jun 21 '11 at 16:34
  • 2
    +1 Comment inside a method is usually a sign of code smell. BUT, comment *in* regex or complex SQL queries is THE way to go. – Toto May 11 '12 at 17:17
  • 1
    @Toto I realize there's debate, for example http://programmers.stackexchange.com/questions/1/comments-are-a-code-smell, but I really can't ever get into the notion that comments are code smell in any case except where the comments don't match the code. – Patrick Oct 21 '12 at 05:52
  • 2
    hi, i had to add A-Z next to every a-z because of youtube like links. but i think it is still excellent anyway – merveotesi Nov 22 '12 at 10:14
  • 4
    I liked the way you broke it down with comments. It's kinda like a regular expression buffet, where you can pick and choose what you want to put on your plate – Expedito Jan 05 '13 at 12:32
  • would you explain why that url doesnt work ? or doesnt match ? **`"My site ULR is http://mywebsite.tn/%D8%A7%D9%84%D9%81%D8%%D8%A7%D9%84%D9%81%D8 is very nice .";`** . how can you edit your regex to match this url pls anuba – Scooter Daraf Aug 06 '16 at 10:58
  • 1
    if you say try that i know for sure that it willl work because you dont make mistakes :) . thanks anuba it works now thats why i asked you :) . +1 – Scooter Daraf Aug 06 '16 at 11:42
  • @anubhava what about that link **`this.is.not.a.url.but.your.regex.will.think.so`** ? – Scooter Daraf Aug 06 '16 at 12:02
  • or that sentence **`I wrote this sentence.im like it a lot`** this sentence.im will be detected as url, any work around them ? – Scooter Daraf Aug 06 '16 at 12:11
  • `this.is.not.a.url.but.your.regex.will.think.so` will be matched but it won't match `I wrote this sentence.im like it a lot`. Even `parse_url` PHP function will match first string. Regex cannot figure out what domains are valid and what are not. – anubhava Aug 06 '16 at 12:22
  • you right , i have tested your regex here with some urls https://regex101.com/r/fG4hM6/1 – Scooter Daraf Aug 06 '16 at 13:22
  • Hi again , i tried to use diegpperinis regex like that **if(preg_match("_^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$_iuS", $i)){** – Scooter Daraf Aug 06 '16 at 14:12
  • But im getting this error ** `Undefined variable: _iuS`** and **`Warning: preg_match(): No ending delimiter '_' found in`** How can i fix them as im not good in regex – Scooter Daraf Aug 06 '16 at 14:13
  • `diegoperinis` looks like match all kind urls – Scooter Daraf Aug 06 '16 at 14:14
  • sorry anuba i writed diegoperinis regex and the error but i didnt share the website where it shows . Please look here **https://mathiasbynens.be/demo/url-regex** And how i can fix the errors when using it like i pasted . thanks – Scooter Daraf Aug 06 '16 at 17:29
  • @ScooterDaraf: I suggest you open a new question at this point linking to this one. Solving all the new cases from comments section on an old question isn't the right way. – anubhava Aug 07 '16 at 05:26
  • 1
    this URL http://ideone.com/eP88F is now a "Solution not found" and should be removed. For future use, that website should not be used to embed links, as it is known for not storing code for long periods of time. – Funk Forty Niner Oct 22 '17 at 12:36
  • Thanks @Fred-ii-, I removed that outdated link from my answer. – anubhava Oct 22 '17 at 12:42
  • 1
    Very good, thanks for sharing it. I had personally improved it in this way: instead of matching gTLD/TLDs using: `([a-z]{2,4})` I decided to use a list of gTLD/TLDs and changed the Host/IP regex in this way: `(com|net|pizza|co.uk|whatever)(?![a-z])`. Doing that you can match `www.google.pizza` and dont match `www.google.pizzam`. NB: If you are looking for a list of TLD: https://github.com/umpirsky/tld-list/blob/master/data/en/tld.php Cheers. – Matteo Martinelli Nov 18 '17 at 14:59
  • Notice that this regex also match email addresses. – Davy de Vries Jul 07 '19 at 10:27
20

This worked for me in all cases I had tested:

$url_pattern = '/((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z0-9\&\.\/\?\:@\-_=#])*/';

Tests:

http://test.test-75.1474.stackoverflow.com/
https://www.stackoverflow.com
https://www.stackoverflow.com/
http://wwww.stackoverflow.com/
http://wwww.stackoverflow.com


http://test.test-75.1474.stackoverflow.com/
http://www.stackoverflow.com
http://www.stackoverflow.com/
stackoverflow.com/
stackoverflow.com

http://www.example.com/etcetc
www.example.com/etcetc
example.com/etcetc
user:pass@example.com/etcetc

example.com/etcetc?query=aasd
example.com/etcetc?query=aasd&dest=asds

http://stackoverflow.com/questions/6427530/regular-expression-pattern-to-match-url-with-or-without-http-www
http://stackoverflow.com/questions/6427530/regular-expression-pattern-to-match-url-with-or-without-http-www/

Every valid Internet URL has at least one dot, so the above pattern will simply try to find any at least two strings chained by a dot and has valid characters that URL may have.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
H Aßdøµ
  • 2,925
  • 4
  • 26
  • 37
  • 2
    simplified this regex a bit: ``/^[a-z0-9./?:@\-_=#]+\.([a-z0-9./?:@\-_=#])*$/i`` - meta chars don't need to be escaped within square brackets - stripped the optional part in front, doesn't required for validating the url (in don't need the captured values in my use case) - simplified pattern with a case-less modifier instead repeating everything within the character groups – staabm Jul 28 '14 at 07:58
  • another glitch: the above regex does not work for urls containing parameters (and therefore an &). also encoded params are not supported - % sign. – staabm Jul 28 '14 at 16:08
  • 1
    /(http|https)\:\/\/+[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z0-9\&\.\/\?\:@\-_=#])*/ please use + instead of ? after (http|https)\:\/\/ as ? also passes the http:/ so this way http:/yahoo.com is correct which is not actually. adding the + sign will fix it. – Roop Kumar Mar 23 '17 at 07:49
  • 1
    From the original pattern, I only replaced the last `*` with a `+` to avoid that strings like `word.` matches the expression. Only strings like `word.com` should match. – Roger May 23 '17 at 07:47
  • Finally, I found it better to replace the last `*` with `{2,}`. – Roger May 23 '17 at 07:55
  • `2020-08-06T16:26:23.561Z` this string also get passed by this. – Awais Ayub Aug 15 '20 at 01:13
5

Try this:

/^http:\/\/|(www\.)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$/

It works exactly like the people want.

It takes with or with out http://, https://, and www.

5

You can use a question mark after a regular expression to make it conditional so you would want to use:

http:\/\/(www\.)?

That will match anything that has either http://www. or http:// (with no www.)

You could just use a replace method to remove the above, thus getting you the domain. It depends on what you need the domain for.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Michael Wright
  • 581
  • 2
  • 2
3

Use:

/(https?://)?((?:(\w+-)*\w+)\.)+(?:[a-z]{2})(\/?\w?-?=?_?\??&?)+[\.]?([a-z0-9\?=&_\-%#])?/g

It matches something.com, http(s):// or www. It does not match other [something]:// URLs though, but for my purpose that's not necessary.

The regex matches e.g.:

http://foo.co.uk/
www.regex.com/foo.html?q=bar$some=thi-ng,regex
regex.foo.com/blog
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Nyveria
  • 168
  • 13
3

Try something like this:

.*([\w-]+\.)+[a-z]{2,5}(/[\w-]+)*
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
morja
  • 8,297
  • 2
  • 39
  • 59
1

You can try this:

r"(http[s]:\/\/)?([\w-]+\.)+([a-z]{2,5})(\/+\w+)? "

Selection:

  1. may be start with http:// or https:// (optional)

  2. anything (word) end with dot (.)

  3. followed by 2 to 5 character [a-z]

  4. followed by "/[anything]" (optional)

  5. followed by space

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
1

Try this

$url_reg = /(ftp|https?):\/\/(\w+:?\w*@)?(\S+)(:[0-9]+)?(\/([\w#!:.?+=&%@!\/-])?)?/;
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
K6t
  • 1,821
  • 1
  • 13
  • 21
  • this expression worked on all except the ones that miseed the http://www. such as example.com/khafenxj – Edmund Rojas Jun 21 '11 at 15:22
  • is there a way to make the "www." part also optional?, I know a little about regex but I still find it complicated to read lol – Edmund Rojas Jun 21 '11 at 15:28
  • That shouldn't work on anythng that misses http:// though, or anything else that misses the protocol. – phant0m Jun 21 '11 at 15:40
1

I have been using the following, which works for all my test cases, as well as fixes any issues where it would trigger at the end of a sentence preceded by a full-stop (end.), or where there were single character initials, such as 'C.C. Plumbing'.

The following regex contains multiple {2,}s, which means two or more matches of the previous pattern.

((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]{2,}\.([a-zA-Z0-9\&\.\/\?\:@\-_=#]){2,}

Matches URLs such as, but not limited to:

Does not match non-URLs such as, but not limited to:

  • C.C Plumber
  • A full-stop at the end of a sentence.
  • Single characters such as a.b or x.y

Please note: Due to the above, this will not match any single character URLs, such as: a.co, but it will match if it is preceded by a URL scheme, such as: http://a.co.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
0

I was getting so many issues getting the answer from anubhava to work due to recent PHP allowing $ in strings and the preg match wasn't working.

Here is what I used:

// Regular expression
$re = '/((https?|ftp):\/\/)?([a-z0-9+!*(),;?&=.-]+(:[a-z0-9+!*(),;?&=.-]+)?@)?([a-z0-9\-\.]*)\.(([a-z]{2,4})|([0-9]{1,3}\.([0-9]{1,3})\.([0-9]{1,3})))(:[0-9]{2,5})?(\/([a-z0-9+%-]\.?)+)*\/?(\?[a-z+&$_.-][a-z0-9;:@&%=+\/.-]*)?(#[a-z_.-][a-z0-9+$%_.-]*)?/i';
// Match all
preg_match_all($re, $blob, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
// The first element of the array is the full match
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Mederic
  • 1,949
  • 4
  • 19
  • 36
0

This PHP Composer package URL highlight is doing a good job in PHP:

<?php
    use VStelmakh\UrlHighlight\UrlHighlight;

    $urlHighlight = new UrlHighlight();
    $matches = $urlHighlight->getUrls($string);
?>
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
İlter Kağan Öcal
  • 3,530
  • 1
  • 17
  • 10
-1

If it does not have to be regex, you could always use the validate filters that are in PHP.

filter_var('http://example.com', FILTER_VALIDATE_URL);

filter_var (mixed $variable [, int $filter = FILTER_DEFAULT [, mixed $options ]]);

Types of Filters

Validate Filters

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Mark Tomlin
  • 8,593
  • 11
  • 57
  • 72
  • This seems to expect the URL to have a protocol when I try it? – benedict_w Nov 15 '13 at 09:09
  • 2
    Validates value as URL (according to http://www.faqs.org/rfcs/rfc2396), optionally with required components. Beware a valid URL may not specify the HTTP protocol http:// so further validation may be required to determine the URL uses an expected protocol, e.g. ssh:// or mailto:. Note that the function will only find ASCII URLs to be valid; internationalized domain names (containing non-ASCII characters) will fail. -- However, as this is built into PHP, you can expect it to be upgraded and updated later on to be made more useful. – Mark Tomlin Nov 15 '13 at 11:40
-1

Regex if you want to ensure a URL starts with HTTP/HTTPS:

https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)

If you do not require the HTTP protocol:

[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
gaya3_96
  • 1
  • 1