The essence of your problem with specifying the regex is a difference of one byte: q
versus qr
. You're writing a regex, so call it what it is. Treating the pattern as a string means you have to deal with the rules for string quoting on top of the rules for regex escaping.
As for the language that your regex matches, add anchors to force the pattern to match the entire line. The regex engine is fiercely determined and will keep working until it finds a match. Without anchors, it's happy to find a substring.
Sometimes this gives you surprising results. Have you ever dealt with a petulant child (or a childish adult) who takes a narrow, exceedingly literal interpretation of what you say? The regex engine is that way, but it's trying to help.
With the last example it matches because
- You said with the
?
quantifier that the cred=...
subpattern can match zero times, so the regex engine skipped it.
- You said the script name is the following substring that's a run of one or more non-whitespace, non-backslash characters, so the regex engine saw
cred=username/password
, none of which are whitespace or backslash characters, and matched. Regexes are greedy: they consider what's right in front of them without regard to whether a given substring “should have” been matched by another subpattern.
The last example fits the bill—although not in the way that you intended. An important lesson with regexes is any quantifier such as ?
or *
that can match zero times always succeeds!
Without the $
anchor, the pattern from your question leaves the trailing backslash unmatched, which you can see with a slight modification to $runpat
.
qr{run +(?:cred=(?:[^\s']*|\'.*\') +)?([^\s\\]+)(.*)}; # ' SO hiliter hack
Notice the (.*)
at the end to grab any non-newline characters that may be left. Changing the loop to
while (<DATA>) {
next unless /$runpat/;
print "line $.: \$1=[$1]; \$2=[$2]\n";
}
gives the following output for line 15.
line 15: $1=[cred=username/password]; $2=[ \]
As a complete program, that becomes
#! /usr/bin/env perl
use strict;
use warnings;
# The goofy comment on the next line is a hack to
# help Stack Overflow's syntax highlighter recover
# from its confusion after seeing the quotes. It's
# for presentation only: you won't need it in your
# real code.
my $runpat = qr{^\s*run +(?:cred=(?:[^\s']*|\'.*\') +)?([^\s\\]+)$}; # '
while (<DATA>) {
next unless /$runpat/;
print "line $.: \$1=[$1]\n";
}
__DATA__
# normal way
run cred=username/password script.bi
# single quoted username password, also separated in a different way
run cred='username password' script.bi
# username/password is optional
run script.bi
# script extension is optional
run script
# the call might be broken into multiple lines using \
# THIS ONE SHOULD NOT MATCH
run cred=username/password \
script.bi
Output:
line 2: $1=[script.bi]
line 5: $1=[script.bi]
line 8: $1=[script.bi]
line 11: $1=[script]
Conciseness isn't always helpful with regexes. Consider the following alternative but equivalent specification:
my $runpat = qr{
^ \s*
(?:
run \s+ cred=(?:[^\s']*|'.*?') \s+ (?<script> [^\s\\]+) # ' hiliter
| run \s+ (?!cred=) (?<script> [^\s\\]+)
)
\s* $
}x;
Yes, it takes more room to write, but it's clearer about acceptable alternatives. Your loop is nearly the same
while (<DATA>) {
next unless /$runpat/;
print "line $.: script=[$+{script}]\n";
}
and even spares the poor reader from having to count parentheses.
To use named capture buffers, e.g., (?<script>...)
, be sure to add
use 5.10.0;
to the top of your program to provide executable documentation of the minimum required version of perl.