Regex split string on a char with exception for inner-string

Question

I have a string like aa | bb | "cc | dd" | 'ee | ff' and I'm looking for a way to split this to get all the values separated by the | character with exeption for | contained in strings.

The idea is to get something like this [a, b, "cc | dd", 'ee | ff']

I've already found an answer to a similar question here : https://stackoverflow.com/a/11457952/11260467

However I can't find a way to adapt it for a case with multiple separator characters, is there someone out here which is less dumb than me when it come to regular expressions ?

I mean that the string should not be splitted if `|` is found between two `'` or `"` — Xiidref, Oct 16 '21 at 19:32
[And an idea with `preg_split()`](https://tio.run/##K8go@P/fxj7AI4CLS6W4pEjBVkE9MVGhRiEpCUgoJScDyZQUJSAZo56aCqTS0mLU1a2BiotSi4GKC4pS0@OLC3IySzTU6zSilWLUYzX1tOxjDDW0gr09AzQ1tNw0a2KKtWJARJ26jgLIEk2g/oKizLyS@CINkDkg/v//AA) — bobble bubble, Oct 16 '21 at 21:04

Jan · Accepted Answer · 2021-10-16T21:53:38.383

This is easily done with the (*SKIP)(*FAIL) functionality pcre offers:

(['"]).*?\1(*SKIP)(*FAIL)|\s*\|\s*

In PHP this could be:

<?php

$string = "aa | bb | \"cc | dd\" | 'ee | ff'";

$pattern = '~([\'"]).*?\1(*SKIP)(*FAIL)|\s*\|\s*~';

$splitted = preg_split($pattern, $string);
print_r($splitted);
?>

And would yield

Array
(
    [0] => aa
    [1] => bb
    [2] => "cc | dd"
    [3] => 'ee | ff'
)

See a demo on regex101.com and on ideone.com.

score 3 · Answer 2 · answered Oct 16 '21 at 19:48

This is easier if you match the parts (not split). Patterns are greedy by default, they will consume as many characters as possible. This allows to define more complex patterns for the quoted string before providing a pattern for an unquoted token:

$subject = '[ aa | bb | "cc | dd" | \'ee | ff\' ]';

$pattern = <<<'PATTERN'
(
    (?:[|[]|^) # after | or [ or string start
    \s*
    (?<token> # name the match
        "[^"]*" # string in double quotes
        |
        '[^']*'  # string in single quotes
        |
        [^\s|]+ # non-whitespace 
    )
    \s*
)x
PATTERN;

preg_match_all($pattern, $subject, $matches);
var_dump($matches['token']);

Output:

array(4) {
  [0]=>
  string(2) "aa"
  [1]=>
  string(2) "bb"
  [2]=>
  string(9) ""cc | dd""
  [3]=>
  string(9) "'ee | ff'"
}

Hints:

The <<<'PATTERN' is called HEREDOC syntax and cuts down on escaping
I use () as pattern delimiters - they are group 0
Naming matches makes code a lot more readable
Modifier x allows to indent and comment the pattern

score 2 · Answer 3 · answered Oct 16 '21 at 21:04

Use

$string = "aa | bb | \"cc | dd\" | 'ee | ff'";
preg_match_all("~(?|\"([^\"]*)\"|'([^']*)'|([^|'\"]+))(?:\s*\|\s*|\z)~", $string, $matches);
print_r(array_map(function($x) {return trim($x);}, $matches[1]));

See PHP proof.

Results:

Array
(
    [0] => aa
    [1] => bb
    [2] => cc | dd
    [3] => ee | ff
)

EXPLANATION

--------------------------------------------------------------------------------
  (?|                      Branch reset group, does not capture:
--------------------------------------------------------------------------------
    \"                       '"'
--------------------------------------------------------------------------------
    (                        group and capture to \1:
--------------------------------------------------------------------------------
      [^\"]*                   any character except: '\"' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )                        end of \1
--------------------------------------------------------------------------------
    \"                       '"'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    '                        '\''
--------------------------------------------------------------------------------
    (                        group and capture to \1:
--------------------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )                        end of \1
--------------------------------------------------------------------------------
    '                        '\''
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    (                        group and capture to \1:
--------------------------------------------------------------------------------
      [^|'\"]+                 any character except: '|', ''', '\"'
                               (1 or more times (matching the most
                               amount possible))
--------------------------------------------------------------------------------
    )                        end of \1
--------------------------------------------------------------------------------
  )                        end of grouping
--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    \|                       '|'
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    \z                       the end of the string
--------------------------------------------------------------------------------
  )                        end of grouping

Your approach is the way to go IMO. But to avoid to trim the result, I would change the third branch to [that](https://3v4l.org/j4O2X). — Casimir et Hippolyte, Oct 17 '21 at 13:30

Cary Swoveland · Answer 4 · 2021-10-17T01:13:10.780

It's interesting that there are so many ways to construct a regular expression for this problem. Here is another that is similar to @Jan's answer.

(['"]).*?\1\K| *\| *

PCRE Demo

(['"]) # match a single or double quote and save to capture group 1
.*?    # match zero or more characters lazily
\1     # match the content of capture group 1
\K     # reset the starting point of the reported match and discard
       # any previously-consumed characters from the reported match
|      # or
\ *    # match zero or more spaces
\|     # match a pipe character
\ *    # match zero or more spaces

Notice that the part before the pipe character ("or") serves merely to move the engine's internal string pointer to just past the closing quote or a quoted substring.

Regex split string on a char with exception for inner-string

4 Answers4

Hints: