Matching specific regex words in url

Question

I must admit I've never gotten used to using regex, however recently I ran into a problem where the work around would've been more of a pain than using regex. I need to be able to match anything that follows the following pattern at the beginning of a string: {any_url_safe_word} +( "/http://" || "/https://" || "www.") + {any word}. So the following should match:

cars/http://google.com#test
cars/https://google.com#test
cars/www.google.com#test

The follwing shouldn't match:

cars/httdp://google.com#test
cars/http:/google.com#test

What I tried so far is: ^[\w]{1,500}\/[(http\:\/\/)|(https:\/\/])|([www\.])]{0,50}, but that matches cars/http from cars/httpd://google.com.

What is this: {any_url_safe_word}? – user4035 Nov 25 '13 at 15:41 — user4035, Nov 25 '13 at 15:41
e.g: cars, ca_rs, ca_1_rs, etc. Not "c a r s". – Babiker Nov 25 '13 at 15:43 — Babiker, Nov 25 '13 at 15:43

score 3 · Answer 1 · edited May 23 '17 at 11:58

3

This regex could do:

^[\w\d]+\/(?:https?:\/\/)?(?:www\.)?[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}

And if you want to get everything that comes after it, you can just add (.*) to the end...

Live DEMO

enter image description here

And since it seems that the more or less general list of URL safe words contains ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!$&'()*+,;= Source, you may include that too, so you'll get (after simplification):

^[!#$&-.0-;=?-\[\]_a-z~]+\/(?:https?:\/\/)?(?:www\.)?[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}

edited May 23 '17 at 11:58

Community

1
1

answered Nov 25 '13 at 15:52

Enissay

4,969
3
29
56

Not quite that simple : a good regexp, just for domain name pattern matching, from http://hexillion.com/samples/ would be `^(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-](?!\.)){0,61}[a-zA-Z0-9]?\.)*[a-zA-Z0-9](?:[a-zA-Z0-9\-](?!$)){0,61}[a-zA-Z0-9]?$` – CD001 Nov 25 '13 at 15:58
True, matching a domain pattern is much complex, I just picked the simplest version wich fits his needs (I hope so) – Enissay Nov 25 '13 at 16:07
Heh - yeah, drop the `\w\d ... ` for your permitted list of characters in a `[ ... ]` and you should be good I think. – CD001 Nov 25 '13 at 16:19

user4035 · Answer 2 · 2013-11-25T16:36:13.463

<?php
$words = array(
    'cars/http://google.com#test',
    'cars/https://google.com#test',
    'cars/www.google.com#test',
    'cars/httdp://google.com#test',
    'cars/http:/google.com#test',
    'c a r s/http:/google.com#test'
    );

foreach($words as $value)
{
    /*
      \S+           - at least one non-space symbol
      \/            - slash
      (https?:\/\/) - http with possible s then ://
      |             - or
      (www\.)       - www.
      .+            - at least one symbol
     */
    if (preg_match('/^\S+\/(https?:\/\/)|(www\.).+/', $value))
    {
        print $value. " good\n";
    }
    else
    {
        print $value. " bad\n";
    }
}

Prints:

cars/http://google.com#test good
cars/https://google.com#test good
cars/www.google.com#test good
cars/httdp://google.com#test bad
cars/http:/google.com#test bad
c a r s/http:/google.com#test bad

Phil Thomas · Answer 3 · 2013-11-25T16:17:32.790

0

Check out the demo.

[a-z0-9-_.~]+/(https?://|www\.)[a-z0-9]+\.[a-z]{2,6}([/?#a-z0-9-_.~])*

Edit: taken @CD001 comment into account. Be sure to use the i modifier if you don't mind case-sensitivity.

edited Nov 25 '13 at 16:17

answered Nov 25 '13 at 16:06

Phil Thomas

1,237
1
12
33

The problem with using `\w` is that it matches any Perl "word" character and that changes depending on the locale in which PHP is running - technically you'll be matching characters like `Ö` which are **not** valid URL chars (yet). – CD001 Nov 25 '13 at 16:11

Matching specific regex words in url

3 Answers3