RFC3986
The Internet Engineering Task Force's (IETF) Request for Comments (RFC) document number 3986 titled: "Uniform Resource Identifier (URI): Generic Syntax" (RFC3986), is the authoritative standard which describes the precise syntax of all components that make up a valid generic Uniform Resource Identifier (URI). Appendix B presents the regex you need:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
With this regex the URI parts are stored as follows:
scheme = $2
authority = $4
path = $5
query = $7
fragment = $9
For the purpose of documenting the above regex, I've taken the liberty of rewriting it in free-spacing mode with comments and indentation, and present it here in the form of a tested PHP script which parses out all the major parts of a given URI string:
PHP Solution:
<?php // test.php Rev:20130830_0800
$re_rfc3986_parse_generic_uri = '%
# Parse generic URI according to RFC3986 Appendix B.
^ # Anchor to start of string.
(?: # Group for optional scheme.
([^:/?#]+) # $1: Uri SCHEME.
: # Scheme ends with ":".
)? # Scheme is optional.
(?: # Group for optional authority.
// # Authority starts with "//"
([^/?#]*) # $2: Uri AUTHORITY.
)? # Authority is optional.
([^?#]*) # $3: Uri PATH (required).
(?: # Group for optional query.
\? # Query starts with "?".
([^#]*) # $4: Uri QUERY.
)? # Query is optional.
(?: # Group for optional fragment.
\# # Fragment starts with "#".
(.*) # $5: Uri FRAGMENT.
)? # Fragment is optional.
$ # Anchor to end of string.
%x';
$text = "http://www.site.com/part1/part2?key=value#blub";
if (preg_match($re_rfc3986_parse_generic_uri, $text, $matches)) {
print_r($matches);
} else {
echo("String is not a valid URI");
}
?>
Two functional changes were made to the original regex: 1.) the unnecessary capture groups were converted to be non-capturing, and 2.) an $
end of string anchor was added at the end of the expression. Note that an even more readable version could be created by using named capturing groups rather than using numbered capturing groups, but that one would not transfer directly over to JavaScript syntax.
PHP Script Output:
Array
(
[0] => http://www.site.com/part1/part2?key=value#blub
[1] => http
[2] => www.site.com
[3] => /part1/part2
[4] => key=value
[5] => blub
)
JavaScript Solution:
Here is a tested JavaScript function which decomposes a valid URI into its various components:
// Parse a valid URI into its various parts per RFC3986.
function parseValidURI(text) {
var uri_parts;
var re_rfc3986_parse_generic_uri =
/^(?:([^:\/?#]+):)?(?:\/\/([^\/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?$/;
// Use String.replace() with callback function to parse the URI.
text.replace(re_rfc3986_parse_generic_uri,
function(m0,m1,m2,m3,m4,m5) {
uri_parts = {
scheme : m1,
authority : m2,
path : m3,
query : m4,
fragment : m5
};
return; // return value is not used.
});
return uri_parts;
}
Note that the non-path properties of the returned object may be undefined
if not present in the URI string. Also, if the URI string does not match this regex, (i.e. it is obviously invalid), the returned value is undefined
.
Notes:
- The only component of a generic URI that is required is the path (which itself may be empty).
- An empty string is a valid URI!
- The above regex does not validate a URI, but rather parses a given valid URI.
- If the above regex fails to match a URI string, then that string is not a valid URI. However, the converse is not true - if the string does match the above regex, it does not mean that the URI is valid, but just means that it is parsable as a URI.
For those interested in validating a URI and breaking it down further, I've written an article which takes all the parts defined in RFC3986 Appendix A and converts them to regex syntax. See:
Happy regexing!