4

I'm trying to extract the url authority (without protocol and www. if present) and everything after it (if present). My regex so far:

/^(?:http|https)?(?::\/\/)?(?:www\.)?(.*?)(\/.*)/;

This works on an url which has everything, like:

http://www.site.com/part1/part2?key=value#blub

But if I mark the path-capturing group as optional:

/^(?:http|https)?(?::\/\/)?(?:www\.)?(.*?)(\/.*)?/

It doesn't match anymore. Why?

Now if I let the first variant and match:

http://site.com

it extracts : as the first value (authority) and //site.com as the second (path).

I didn't expect this to work, since it doesn't have a path and the path is not marked as optional. But still wonder about this result, since I have only these 2 catching groups - (.*?)(\/.*)

http://jsfiddle.net/U2tKT/1/

Can someone explain me what's wrong. Please no links to complete url parsing solution, I know there are plenty of those, but I want to understand what's wrong with my regex (and how I solve that).

Thanks.

User
  • 31,811
  • 40
  • 131
  • 232

3 Answers3

3

user1436026 posted JUST before I was about to hit the submit button, but here goes:

Your domain (authority) pattern is marked as being "ungreedy", which matches as little as possible. And in your case, it actually satisfies the pattern to match nothing at all - which is about as little as it gets. What you want instead is to have the domain match as much as possible, until you are positive that what it is matching is no longer a domain (I changed the regex to match anything but /, and as much as it can find.)

/^(?:http|https)?(?::\/\/)?(?:www\.)?([^\/]+)(\/.*)?/

I know you specifically state you don't want any links to any URL parsing solutions in JS, but did you know JS has it built in already? :)

var link = document.createElement('a');
link.href="http://www.site.com/part1/part2?key=value#blub";
auth=link.hostname; //www.site.com
path=link.pathname; // /part1/part2
Sean Johnson
  • 5,567
  • 2
  • 17
  • 22
  • Ahh... negate operator for /, makes sense. I suspected my problem was related with the greedy thing but didn't know exactly what it was :) thx a lot. – User Aug 30 '13 at 13:28
  • I read now your edit - no, I didn't know. Maybe I change my regex with that. Thanks again (can't vote more)! – User Aug 30 '13 at 13:34
2

At the end of your regex /^(?:http|https)?(?::\/\/)?(?:www\.)?(.*?)(\/.*)?/, the (.*?) (because it has the ? modifier,) is trying to match as little as possible in order to satisfy the regex. Because you have made the last part of your regex optional, the (.*?) does not have to match anything in order to satisfy the rest of the regex because the (\/.*)? is allowed to match nothing. Whereas, when you made the last part of your regex mandatory, (\/.*), the (.*?) was forced to match enough in order for the (\/.*) to match.

DJG
  • 6,413
  • 4
  • 30
  • 51
  • Thanks for the explanation (+1) but the other answer also contains a solution, so I'll select that :) – User Aug 30 '13 at 13:29
1

RFC3986

The Internet Engineering Task Force's (IETF) Request for Comments (RFC) document number 3986 titled: "Uniform Resource Identifier (URI): Generic Syntax" (RFC3986), is the authoritative standard which describes the precise syntax of all components that make up a valid generic Uniform Resource Identifier (URI). Appendix B presents the regex you need:

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

With this regex the URI parts are stored as follows:

scheme = $2
authority = $4
path = $5
query = $7
fragment = $9

For the purpose of documenting the above regex, I've taken the liberty of rewriting it in free-spacing mode with comments and indentation, and present it here in the form of a tested PHP script which parses out all the major parts of a given URI string:

PHP Solution:

<?php // test.php Rev:20130830_0800

$re_rfc3986_parse_generic_uri = '%
    # Parse generic URI according to RFC3986 Appendix B.
    ^             # Anchor to start of string.
    (?:           # Group for optional scheme.
      ([^:/?#]+)  # $1: Uri SCHEME.
      :           # Scheme ends with ":".
    )?            # Scheme is optional.
    (?:           # Group for optional authority.
      //          # Authority starts with "//"
      ([^/?#]*)   # $2: Uri AUTHORITY.
    )?            # Authority is optional.
    ([^?#]*)      # $3: Uri PATH (required).
    (?:           # Group for optional query.
      \?          # Query starts with "?".
      ([^#]*)     # $4: Uri QUERY.
    )?            # Query is optional.
    (?:           # Group for optional fragment.
      \#          # Fragment starts with "#".
      (.*)        # $5: Uri FRAGMENT.
    )?            # Fragment is optional.
    $             # Anchor to end of string.
    %x';

$text = "http://www.site.com/part1/part2?key=value#blub";

if (preg_match($re_rfc3986_parse_generic_uri, $text, $matches)) {
    print_r($matches);
} else {
    echo("String is not a valid URI");
}
?>

Two functional changes were made to the original regex: 1.) the unnecessary capture groups were converted to be non-capturing, and 2.) an $ end of string anchor was added at the end of the expression. Note that an even more readable version could be created by using named capturing groups rather than using numbered capturing groups, but that one would not transfer directly over to JavaScript syntax.

PHP Script Output:

Array
(
[0] => http://www.site.com/part1/part2?key=value#blub
[1] => http
[2] => www.site.com
[3] => /part1/part2
[4] => key=value
[5] => blub
)

JavaScript Solution:

Here is a tested JavaScript function which decomposes a valid URI into its various components:

// Parse a valid URI into its various parts per RFC3986.
function parseValidURI(text) {
    var uri_parts;
    var re_rfc3986_parse_generic_uri =
    /^(?:([^:\/?#]+):)?(?:\/\/([^\/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?$/;
    // Use String.replace() with callback function to parse the URI.
    text.replace(re_rfc3986_parse_generic_uri,
        function(m0,m1,m2,m3,m4,m5) {
            uri_parts = {
                scheme      : m1,
                authority   : m2,
                path        : m3,
                query       : m4,
                fragment    : m5
            };
            return; // return value is not used.
        });
    return uri_parts;
}

Note that the non-path properties of the returned object may be undefined if not present in the URI string. Also, if the URI string does not match this regex, (i.e. it is obviously invalid), the returned value is undefined.

Notes:

  • The only component of a generic URI that is required is the path (which itself may be empty).
  • An empty string is a valid URI!
  • The above regex does not validate a URI, but rather parses a given valid URI.
  • If the above regex fails to match a URI string, then that string is not a valid URI. However, the converse is not true - if the string does match the above regex, it does not mean that the URI is valid, but just means that it is parsable as a URI.

For those interested in validating a URI and breaking it down further, I've written an article which takes all the parts defined in RFC3986 Appendix A and converts them to regex syntax. See:

Regular Expression URI Validation

Happy regexing!

Community
  • 1
  • 1
ridgerunner
  • 33,777
  • 5
  • 57
  • 69
  • Love the Regex, the very intensive research, and the very well written post. Just want to point you to [parse_url](http://php.net/manual/en/function.parse-url.php) for the PHP solution. :) – Sean Johnson Aug 30 '13 at 23:20
  • @Sean Johnson - Thanks. The only reason for the PHP version above was to provide a vehicle to provide the commented, free-spacing mode version of the RFC3986 regex. I agree that if a native library function exists, its best to use that and not try to re-invent the wheel. – ridgerunner Aug 31 '13 at 02:11
  • Nice (+1) although not exactly what I asked - it captures "www" and doesn't force the protocol to be http/https. But I guess it's cleaner to do these checks/filtering on the matched parts, instead of putting it in the regex. This regex has other problem, though, at least for my use case, it will match strings like "www.bla.com" or "bla.com" as path and leave authority undefined. To fix that I changed it to `^(?:([^:\/?#]+):)?(?:\/\/)?([^\/?#]*)([^?#]*)(?:\?([^#]*))?(?:#(.*))?$` – User Sep 04 '13 at 10:03