Appendix B of RFC 2396 gives a regex for parsing URIs.
B. Parsing a URI Reference with a Regular Expression
As described in Section 4.3, the generic URI syntax is not sufficient to disambiguate the components of some forms of URI. Since the “greedy algorithm” described in that section is identical to the disambiguation method used by POSIX regular expressions, it is natural and commonplace to use a regular expression for parsing the potential four components and fragment identifier of a URI reference.
The following line is the regular expression for breaking-down a URI reference into its components.
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis). We refer to the value matched for subexpression n as $<n>
. For example, matching the above expression to
http://www.ics.uci.edu/pub/ietf/uri/#Related
results in the following subexpression matches:
$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related
where <undefined>
indicates that the component is not present, as is the case for the query component in the above example. Therefore, we can determine the value of the four components and fragment as
scheme = $2
authority = $4
path = $5
query = $7
fragment = $9
and, going in the opposite direction, we can recreate a URI reference from its components using the algorithm in step 7 of Section 5.2.
The regex is directly usable in Perl, as in
if ($uri =~ m!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?!) {
my($host,$path) = ($4,$5);
print "$host => $path\n";
}
Greed in regex quantifiers will likely make this pattern challenging to use with s///
because it will consume as much text as possible, likely overrunning unmarked URI boundaries.
More directly applicable is the URI::Find module, available on CPAN. Circumscribing LEFT and RIGHT is as simple as
#! /usr/bin/env perl
use strict;
use warnings;
use URI::Find;
my $finder = URI::Find->new(sub {
my(undef,$found) = @_;
"LEFT $found RIGHT";
});
while (<>) {
$finder->find(\$_);
print;
}
Output:
$ cat input
This is a plain text input suitable for
an answer to a question on http://stackoverflow.com
In particular, the question is available at
http://stackoverflow.com/q/15233535/123109 and the answer
at http://stackoverflow.com/a/15234378/123109
$ ./mark-uris input
This is a plain text input suitable for
an answer to a question on LEFT http://stackoverflow.com RIGHT
In particular, the question is available at
LEFT http://stackoverflow.com/q/15233535/123109 RIGHT and the answer
at LEFT http://stackoverflow.com/a/15234378/123109 RIGHT