4

I am trying to write a regex that will parse the syntax for calling a script and capture the script name.

All of these are valid syntax for the call

# normal way
run cred=username/password script.bi

# single quoted username password, also separated in a different way
run cred='username password' script.bi

# username/password is optional
run script.bi

# script extension is optional
run script

# the call might be broken into multiple lines using \
# THIS ONE SHOULD NOT MATCH
run cred=username/password \
script.bi

This is what I have so far

my $r = q{run +(?:cred=(?:[^\s\']*|\'.*\') +)?([^\s\\]+)};

to capture the values in $1.

But I get a

Unmatched [ before HERE mark in regex m/run +(?:cred=(?:[^\s\']*|\'.*\') +)?([ << HERE ^\s\]+)/
Brad Mace
  • 27,194
  • 17
  • 102
  • 148
Lazer
  • 90,700
  • 113
  • 281
  • 364

3 Answers3

3

The \\ is being treated as a \ and hence in the regex it becomes \] so escaping ] and hence the unmatched [

Replace with run +(?:cred=(?:[^\s\']*|\'.*\') +)?([^\s\\\\]+) ( note the \\\\ ) and try.

Also, from the comments you must be using qr for regex than just q.

( I had just looked at the error, not the validity / efficiency of the regex for your problem)

manojlds
  • 290,304
  • 63
  • 469
  • 417
  • @Lazer Your question is not clear. Formatting is messed up. Basically, the escapes are needed for perl strings first and then the escaped string is treated as regex. So then the escapes for regex apply. I am putting it crudely, but I hope that explains it. – manojlds May 25 '11 at 17:02
  • @manojlds: But I am putting my regex in `q{}`, so we should not need double excaping. Isn't that right? – Lazer May 25 '11 at 17:09
  • @Lazer: More specifically, you have not provided us the context to figure out what problem is with the back-slashes. If you are, for instance, storing this RE in a string and then evaluating the string in the RE, that would cause the problem. As you can see in my example above, it works just fine when directly interpolated. – Seth Robertson May 25 '11 at 17:10
  • 1
    @Lazer: Print out the RE after you assign it. You will note that `` \' -> ' `` and `` \\ -> \ ``. Single quotes still perform some interpolation inside them, in perl. – Seth Robertson May 25 '11 at 17:14
  • @Lazer - agree with @Seth Robertson. Depends on how you are using it. The error suggests you are using it differently. – manojlds May 25 '11 at 17:14
  • If you want to store a string as a regex, you should use the `qr` operator. Then the backslashes will behave as expected. – Platinum Azure May 25 '11 at 19:27
3

The essence of your problem with specifying the regex is a difference of one byte: q versus qr. You're writing a regex, so call it what it is. Treating the pattern as a string means you have to deal with the rules for string quoting on top of the rules for regex escaping.

As for the language that your regex matches, add anchors to force the pattern to match the entire line. The regex engine is fiercely determined and will keep working until it finds a match. Without anchors, it's happy to find a substring.

Sometimes this gives you surprising results. Have you ever dealt with a petulant child (or a childish adult) who takes a narrow, exceedingly literal interpretation of what you say? The regex engine is that way, but it's trying to help.

With the last example it matches because

  • You said with the ? quantifier that the cred=... subpattern can match zero times, so the regex engine skipped it.
  • You said the script name is the following substring that's a run of one or more non-whitespace, non-backslash characters, so the regex engine saw cred=username/password, none of which are whitespace or backslash characters, and matched. Regexes are greedy: they consider what's right in front of them without regard to whether a given substring “should have” been matched by another subpattern.

The last example fits the bill—although not in the way that you intended. An important lesson with regexes is any quantifier such as ? or * that can match zero times always succeeds!

Without the $ anchor, the pattern from your question leaves the trailing backslash unmatched, which you can see with a slight modification to $runpat.

qr{run +(?:cred=(?:[^\s']*|\'.*\') +)?([^\s\\]+)(.*)}; # ' SO hiliter hack

Notice the (.*) at the end to grab any non-newline characters that may be left. Changing the loop to

while (<DATA>) {
  next unless /$runpat/;
  print "line $.: \$1=[$1]; \$2=[$2]\n";
}

gives the following output for line 15.

line 15: $1=[cred=username/password]; $2=[ \]

As a complete program, that becomes

#! /usr/bin/env perl

use strict;
use warnings;

# The goofy comment on the next line is a hack to
# help Stack Overflow's syntax highlighter recover
# from its confusion after seeing the quotes. It's
# for presentation only: you won't need it in your
# real code.
my $runpat = qr{^\s*run +(?:cred=(?:[^\s']*|\'.*\') +)?([^\s\\]+)$}; # '

while (<DATA>) {
  next unless /$runpat/;
  print "line $.: \$1=[$1]\n";
}

__DATA__
# normal way
run cred=username/password script.bi

# single quoted username password, also separated in a different way
run cred='username password' script.bi

# username/password is optional
run script.bi

# script extension is optional
run script

# the call might be broken into multiple lines using \
# THIS ONE SHOULD NOT MATCH
run cred=username/password \
script.bi

Output:

line 2: $1=[script.bi]
line 5: $1=[script.bi]
line 8: $1=[script.bi]
line 11: $1=[script]

Conciseness isn't always helpful with regexes. Consider the following alternative but equivalent specification:

my $runpat = qr{
  ^ \s*
  (?:
    run \s+ cred=(?:[^\s']*|'.*?') \s+ (?<script> [^\s\\]+)  # ' hiliter
  | run \s+ (?!cred=)                  (?<script> [^\s\\]+)
  )
  \s* $
}x;

Yes, it takes more room to write, but it's clearer about acceptable alternatives. Your loop is nearly the same

while (<DATA>) {
  next unless /$runpat/;
  print "line $.: script=[$+{script}]\n";
}

and even spares the poor reader from having to count parentheses.

To use named capture buffers, e.g., (?<script>...), be sure to add

use 5.10.0;

to the top of your program to provide executable documentation of the minimum required version of perl.

Community
  • 1
  • 1
Greg Bacon
  • 134,834
  • 32
  • 188
  • 245
  • For the last example, how come it matches the regex at all? We already specified NOT to match \. How does line 15 match the regex? – Lazer May 25 '11 at 17:21
  • @Lazer Sorry, got a case of tunnel vision. See my updated answer. – Greg Bacon May 25 '11 at 17:30
  • But even without anchors, we specified a +, which means atleast one non-whitespace, non \ should be matched, what am I missing? – Lazer May 25 '11 at 17:38
  • Also, why do you have # ' at the end of the regex in your example? – Lazer May 25 '11 at 17:39
  • @Lazer The comments are there to help Stack Overflow's syntax highlighter recover from what it thinks is a quoted string in your pattern. See updated answer. – Greg Bacon May 25 '11 at 18:16
0

Are there sometimes arguments to the script? If not, why not:

/^run(?:\s.*\s|\s)(\S+)\s*$/

I guess that doesn't work on the line continuation bit.

/^run(?:\s+cred=(?:[^'\s]*|'[^']*')\s+|\s+)([^\\\s]+)\s*$/

Test program:

#!/usr/bin/perl

$foo="# normal way
run cred=username/password script.bi

# single quoted username password, also separated in a different way
run cred='username password' script.bi

# username/password is optional
run script.bi

# script extension is optional
run script

# the call might be broken into multiple lines using \
# THIS ONE SHOULD NOT MATCH
run cred=username/password \\
script.bi
";

foreach my $line (split(/\n/,$foo))
{
  print "Looking >$line<\n";
  print "Match >$1<\n"
    if ($line =~ /^run(?:\s+cred=(?:[^'\s]*|'[^']*')\s+|\s+)([^\\\s]+)\s*$/);
}

Example output:

Looking ># normal way<
Looking >run cred=username/password script.bi<
Match >script.bi<
Looking ><
Looking ># single quoted username password, also separated in a different way<
Looking >run cred='username password' script.bi<
Match >script.bi<
Looking ><
Looking ># username/password is optional<
Looking >run script.bi<
Match >script.bi<
Looking ><
Looking ># script extension is optional<
Looking >run script<
Match >script<
Looking ><
Looking ># the call might be broken into multiple lines using <
Looking ># THIS ONE SHOULD NOT MATCH<
Looking >run cred=username/password \<
Looking >script.bi<
Seth Robertson
  • 30,608
  • 7
  • 64
  • 57
  • @Lazer: Note that your exemplar pattern will fail if there are quotes in the command name. The use of '' [^'] '' inside the quoted string is safer. – Seth Robertson May 25 '11 at 17:07
  • @kindahero: It does work, see the example output. Let me know what you think should have happened that did not. – Seth Robertson May 25 '11 at 20:42
  • in the last instance, the command is spanned in two lines(with \). output doesn't have any matches in that case. – kindahero May 25 '11 at 21:33
  • @kindahero: It isn't supposed to according to the specification. See the comment "THIS ONE SHOULD NOT MATCH"? Copied from the question above. – Seth Robertson May 26 '11 at 01:02