In these situations I'd go for a parsing approach. That way you don't have to come up with a regex that does several different things. This is important as the complexity of the string changes. Even though this looks like more code, it's basic Perl and you put it in a subroutine. I can easily add another token type without disturbing the mechanics of the code or rewriting the pattern. I also used this trick in How do I grab an unknown number of captures from a pattern?:
use v5.10;
my $string = 'for element in hydrogen helium "carbon 14" $(some stuff "here") FILE';
# The types of things you can match, going from most specific
# to least specific. Now you only need to describe what each
# individual thing looks like. Each pattern is responsible for
# the capture group $1, which is the thing we'll save.
my @patterns = (
qr/ ( \$\( .+? \) ) /x,
qr/ ( " .+? " ) /x,
qr/ ( \S+ ) /x,
);
my @tokens;
# The magic is global matching in scalar context,
# using /g. The \G anchor starts matching at the
# last position you matched in the prior match of
# the same string (that's in pos()). Normally that
# position is reset when a match fails, but /c
# prevents that so you can try other patterns. Once
# you match a pattern, save what you matched and
# move on.
#
# The pattern here also takes care of trailing whitespace.
while( pos($string) < length($string) ) {
foreach my $pattern ( @patterns ) {
next unless $string =~ m/ \G $pattern \s*/gcx;
push @tokens, $1;
last;
}
}
use Data::Dumper;
say Dumper( \@tokens );
You can do much of the same with the branch reset operator for each capture in the alternation is $1
:
use v5.10;
my $string = 'for element in hydrogen helium "carbon 14" $(some stuff "here") FILE';
my @tokens = $string =~ m/
(?|
(?: ( \$ \( .+? \) ) ) |
(?: ( " .+? " ) ) |
(?: ( \S+ ) )
)
/gx;
use Data::Dumper;
say Dumper( \@tokens );
These are a bit more complex than zdim's answer, but it's much more flexible. Say, for instance, that you decide that you don't want the quotes around "carbon 14"
. That's a very easy fix because the structure of the regex doesn't change. You only change that subpattern that deals with that token:
(?|
(?: ( \$ \( .+? \) ) ) |
(?: " ( .+? ) " ) |
(?: ( \S+ ) )
)
You may not need this extra flexibility. I usually find that I run into additional weird situations in these sorts of tasks, so I start with the flexible solution. It's not a big deal after you've done it a couple times.
As for your error, you got:
Lookbehind longer than 255 not implemented in regex.
Before v5.30, you couldn't have a variable-width lookbehind. Now it's an experimental feature, but the pattern has to know beforehand that it won't go over 255 characters. Your pattern has (?<=\"[^\"]*\")
, and that *
is zero or more. That more can be greater than 255, so it's an illegal pattern.
regexr.com uses PCRE, which used to stand for "Perl Compatible", but they have diverged enough that some things that look like they work may be fine in other languages, but not work in Perl. This usually isn't a problem, but lookbehinds is one of the differences.