5

I have a string like aa | bb | "cc | dd" | 'ee | ff' and I'm looking for a way to split this to get all the values separated by the | character with exeption for | contained in strings.

The idea is to get something like this [a, b, "cc | dd", 'ee | ff']

I've already found an answer to a similar question here : https://stackoverflow.com/a/11457952/11260467

However I can't find a way to adapt it for a case with multiple separator characters, is there someone out here which is less dumb than me when it come to regular expressions ?

Xiidref
  • 1,456
  • 8
  • 20
  • What do you mean by multiple separator characters? – testing_22 Oct 16 '21 at 19:15
  • I mean that the string should not be splitted if `|` is found between two `'` or `"` – Xiidref Oct 16 '21 at 19:32
  • [And an idea with `preg_split()`](https://tio.run/##K8go@P/fxj7AI4CLS6W4pEjBVkE9MVGhRiEpCUgoJScDyZQUJSAZo56aCqTS0mLU1a2BiotSi4GKC4pS0@OLC3IySzTU6zSilWLUYzX1tOxjDDW0gr09AzQ1tNw0a2KKtWJARJ26jgLIEk2g/oKizLyS@CINkDkg/v//AA) – bobble bubble Oct 16 '21 at 21:04

4 Answers4

5

This is easily done with the (*SKIP)(*FAIL) functionality pcre offers:

(['"]).*?\1(*SKIP)(*FAIL)|\s*\|\s*

In PHP this could be:

<?php

$string = "aa | bb | \"cc | dd\" | 'ee | ff'";

$pattern = '~([\'"]).*?\1(*SKIP)(*FAIL)|\s*\|\s*~';

$splitted = preg_split($pattern, $string);
print_r($splitted);
?>

And would yield

Array
(
    [0] => aa
    [1] => bb
    [2] => "cc | dd"
    [3] => 'ee | ff'
)

See a demo on regex101.com and on ideone.com.

Jan
  • 42,290
  • 8
  • 54
  • 79
3

This is easier if you match the parts (not split). Patterns are greedy by default, they will consume as many characters as possible. This allows to define more complex patterns for the quoted string before providing a pattern for an unquoted token:

$subject = '[ aa | bb | "cc | dd" | \'ee | ff\' ]';

$pattern = <<<'PATTERN'
(
    (?:[|[]|^) # after | or [ or string start
    \s*
    (?<token> # name the match
        "[^"]*" # string in double quotes
        |
        '[^']*'  # string in single quotes
        |
        [^\s|]+ # non-whitespace 
    )
    \s*
)x
PATTERN;

preg_match_all($pattern, $subject, $matches);
var_dump($matches['token']);

Output:

array(4) {
  [0]=>
  string(2) "aa"
  [1]=>
  string(2) "bb"
  [2]=>
  string(9) ""cc | dd""
  [3]=>
  string(9) "'ee | ff'"
}

Hints:

  1. The <<<'PATTERN' is called HEREDOC syntax and cuts down on escaping
  2. I use () as pattern delimiters - they are group 0
  3. Naming matches makes code a lot more readable
  4. Modifier x allows to indent and comment the pattern
ThW
  • 19,120
  • 3
  • 22
  • 44
2

Use

$string = "aa | bb | \"cc | dd\" | 'ee | ff'";
preg_match_all("~(?|\"([^\"]*)\"|'([^']*)'|([^|'\"]+))(?:\s*\|\s*|\z)~", $string, $matches);
print_r(array_map(function($x) {return trim($x);}, $matches[1]));

See PHP proof.

Results:

Array
(
    [0] => aa
    [1] => bb
    [2] => cc | dd
    [3] => ee | ff
)

EXPLANATION

--------------------------------------------------------------------------------
  (?|                      Branch reset group, does not capture:
--------------------------------------------------------------------------------
    \"                       '"'
--------------------------------------------------------------------------------
    (                        group and capture to \1:
--------------------------------------------------------------------------------
      [^\"]*                   any character except: '\"' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )                        end of \1
--------------------------------------------------------------------------------
    \"                       '"'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    '                        '\''
--------------------------------------------------------------------------------
    (                        group and capture to \1:
--------------------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )                        end of \1
--------------------------------------------------------------------------------
    '                        '\''
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    (                        group and capture to \1:
--------------------------------------------------------------------------------
      [^|'\"]+                 any character except: '|', ''', '\"'
                               (1 or more times (matching the most
                               amount possible))
--------------------------------------------------------------------------------
    )                        end of \1
--------------------------------------------------------------------------------
  )                        end of grouping
--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    \|                       '|'
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    \z                       the end of the string
--------------------------------------------------------------------------------
  )                        end of grouping
Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37
2

It's interesting that there are so many ways to construct a regular expression for this problem. Here is another that is similar to @Jan's answer.

(['"]).*?\1\K| *\| *

PCRE Demo

(['"]) # match a single or double quote and save to capture group 1
.*?    # match zero or more characters lazily
\1     # match the content of capture group 1
\K     # reset the starting point of the reported match and discard
       # any previously-consumed characters from the reported match
|      # or
\ *    # match zero or more spaces
\|     # match a pipe character
\ *    # match zero or more spaces

Notice that the part before the pipe character ("or") serves merely to move the engine's internal string pointer to just past the closing quote or a quoted substring.

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100