Perl split on spaces selectively

Question

I am trying to split a string on spaces between elements in perl. However, each element may also contain spaces (either through double quotes or enclosed within brackets).

For example, a string containing:

for element in hydrogen   helium  "carbon  14"    $(some stuff "here")   FILE

I would like to end up with an array like (hydrogen, helium, "carbon 14", "$(some stuff "here")", FILE)

I can deal with the for element in bit and get the rest as one string. I have tried doing

@elements = split /(?<=\"[^\"]*\")\s+(?=\"[^\"]*\")/, $list

and although the regex DOES match ONLY the white space between quotes (checked on regexr.com), the perl program gives me Lookbehind longer than 255 not implemented in regex.

Is there maybe a better way of using split on whitespace that would take this into account? What am I doing wrong with my regex?

Do we need to worry about bracket quoted things having `)` character inside them? (like `$(blah fyvg "fhgh)" fyyh)`) — pii_ke, Aug 01 '20 at 07:24

zdim · Answer 1 · 2020-08-01T09:07:23.930

Match either a quoted or parenthesized expression, then alternated with a non-space sequence

my @elems = $string =~ / ( "[^"]+" | \S*\( [^)]+ \)\S* | \S+ ) /gx;

Tested with your string, and some simple variations.

This assumes that there's no nesting of either delimiters: an expression between consecutive quotes goes whole as one element (even if it had parenthesized subexpressions), and so does one inside parenthesis (even if it has quoted segments). This is inferred from the question.

I've allowed a non-space sequence of characters preceding and following the parenthesis, to accommodate that $ in front. Adjust that if it can indeed only be a dollar in front.

score 2 · Answer 2 · answered Aug 01 '20 at 19:08

In these situations I'd go for a parsing approach. That way you don't have to come up with a regex that does several different things. This is important as the complexity of the string changes. Even though this looks like more code, it's basic Perl and you put it in a subroutine. I can easily add another token type without disturbing the mechanics of the code or rewriting the pattern. I also used this trick in How do I grab an unknown number of captures from a pattern?:

use v5.10;

my $string = 'for element in hydrogen   helium  "carbon  14"    $(some stuff "here")   FILE';

# The types of things you can match, going from most specific
# to least specific. Now you only need to describe what each
# individual thing looks like. Each pattern is responsible for
# the capture group $1, which is the thing we'll save.
my @patterns = (
    qr/ ( \$\( .+? \) ) /x,
    qr/ ( " .+? " )     /x,
    qr/ ( \S+ )         /x,
    );

my @tokens;
# The magic is global matching in scalar context,
# using /g. The \G anchor starts matching at the
# last position you matched in the prior match of
# the same string (that's in pos()). Normally that
# position is reset when a match fails, but /c
# prevents that so you can try other patterns. Once
# you match a pattern, save what you matched and
# move on.
#
# The pattern here also takes care of trailing whitespace.
while( pos($string) < length($string) ) {
    foreach my $pattern ( @patterns ) {
        next unless $string =~ m/ \G $pattern \s*/gcx;
        push @tokens, $1;
        last;
        }
    }

use Data::Dumper;
say Dumper( \@tokens );

You can do much of the same with the branch reset operator for each capture in the alternation is $1:

use v5.10;

my $string = 'for element in hydrogen   helium  "carbon  14"    $(some stuff "here")   FILE';

my @tokens = $string =~ m/
    (?|
        (?: ( \$ \( .+? \) ) ) |
        (?: ( " .+? "      ) ) |
        (?: ( \S+          ) )
    )
    /gx;

use Data::Dumper;
say Dumper( \@tokens );

These are a bit more complex than zdim's answer, but it's much more flexible. Say, for instance, that you decide that you don't want the quotes around "carbon 14". That's a very easy fix because the structure of the regex doesn't change. You only change that subpattern that deals with that token:

    (?|
        (?:   ( \$ \( .+? \) )   ) |
        (?: " ( .+?          ) " ) |
        (?:   ( \S+          )   )
    )

You may not need this extra flexibility. I usually find that I run into additional weird situations in these sorts of tasks, so I start with the flexible solution. It's not a big deal after you've done it a couple times.

As for your error, you got:

Lookbehind longer than 255 not implemented in regex.

Before v5.30, you couldn't have a variable-width lookbehind. Now it's an experimental feature, but the pattern has to know beforehand that it won't go over 255 characters. Your pattern has (?<=\"[^\"]*\"), and that * is zero or more. That more can be greater than 255, so it's an illegal pattern.

regexr.com uses PCRE, which used to stand for "Perl Compatible", but they have diverged enough that some things that look like they work may be fine in other languages, but not work in Perl. This usually isn't a problem, but lookbehinds is one of the differences.

Perl split on spaces selectively

2 Answers2