8

In PHP I have the following string :

$str = "AAA, BBB, (CCC,DDD), 'EEE', 'FFF,GGG', ('HHH','III'), (('JJJ','KKK'), LLL, (MMM,NNN)) , OOO"; 

I need to split this string into the following parts:

AAA
BBB
(CCC,DDD)
'EEE'
'FFF,GGG'
('HHH','III')
(('JJJ','KKK'),LLL, (MMM,NNN))
OOO

I tried several regexes, but couldn't find a solution. Any ideas?

UPDATE

I've decided using regex is not really the best solution, when dealing with malformed data, escaped quotes, etc.

Thanks to suggestions made on here, I found a function that uses parsing, which I rewrote to suit my needs. It can handle different kind of brackets and the separator and quote are parameters as well.

 function explode_brackets($str, $separator=",", $leftbracket="(", $rightbracket=")", $quote="'", $ignore_escaped_quotes=true ) {

    $buffer = '';
    $stack = array();
    $depth = 0;
    $betweenquotes = false;
    $len = strlen($str);
    for ($i=0; $i<$len; $i++) {
      $previouschar = $char;
      $char = $str[$i];
      switch ($char) {
        case $separator:
          if (!$betweenquotes) {
            if (!$depth) {
              if ($buffer !== '') {
                $stack[] = $buffer;
                $buffer = '';
              }
              continue 2;
            }
          }
          break;
        case $quote:
          if ($ignore_escaped_quotes) {
            if ($previouschar!="\\") {
              $betweenquotes = !$betweenquotes;
            }
          } else {
            $betweenquotes = !$betweenquotes;
          }
          break;
        case $leftbracket:
          if (!$betweenquotes) {
            $depth++;
          }
          break;
        case $rightbracket:
          if (!$betweenquotes) {
            if ($depth) {
              $depth--;
            } else {
              $stack[] = $buffer.$char;
              $buffer = '';
              continue 2;
            }
          }
          break;
        }
        $buffer .= $char;
    }
    if ($buffer !== '') {
      $stack[] = $buffer;
    }

    return $stack;
  }
Dylan
  • 9,129
  • 20
  • 96
  • 153

2 Answers2

9

Instead of a preg_split, do a preg_match_all:

$str = "AAA, BBB, (CCC,DDD), 'EEE', 'FFF,GGG', ('HHH','III'), (('JJJ','KKK'), LLL, (MMM,NNN)) , OOO"; 

preg_match_all("/\((?:[^()]|(?R))+\)|'[^']*'|[^(),\s]+/", $str, $matches);

print_r($matches);

will print:

Array
(
    [0] => Array
        (
            [0] => AAA
            [1] => BBB
            [2] => (CCC,DDD)
            [3] => 'EEE'
            [4] => 'FFF,GGG'
            [5] => ('HHH','III')
            [6] => (('JJJ','KKK'), LLL, (MMM,NNN))
            [7] => OOO
        )

)

The regex \((?:[^()]|(?R))+\)|'[^']*'|[^(),\s]+ can be divided in three parts:

  1. \((?:[^()]|(?R))+\), which matches balanced pairs of parenthesis
  2. '[^']*' matching a quoted string
  3. [^(),\s]+ which matches any char-sequence not consisting of '(', ')', ',' or white-space chars
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • 1
    While you can match, it generally provides no guarantee when it is run against a bad input string. – nhahtdh Mar 05 '13 at 21:34
  • Hi Bart, thanks a lot. Could you think of any way to make 'FFF,GGG' appear as 1 match? – Dylan Mar 05 '13 at 22:17
  • Thanks again, it works great now, so I'll accept your answer as the right one. But I still decided to use parsing in my project instead, because of the possibility of malformed input data and escaped quotes, see my update of the question. – Dylan Mar 06 '13 at 18:49
  • @Dylan: My solution is resistant against malformed input data, and can be modified to work with escaped quote. But then again, it is not easily maintainable without deep regex knowledge, and cannot point out where exactly the syntax error is (it knows that error is somewhere ahead, but not exactly where). Manual parsing is better in such cases. – nhahtdh Mar 06 '13 at 22:50
  • @BartKiers This answer looks great according to my usecase, but doesn't works, can you please help me out with this at http://stackoverflow.com/questions/37183910/explode-arrays-in-php-excluding-the-within-braces – jitendrapurohit May 12 '16 at 11:41
3

Crazy solution

A spartan regex that tokenizes and also validates all the tokens that it extracts:

\G\s*+((\((?:\s*+(?2)\s*+(?(?!\)),)|\s*+[^()',\s]++\s*+(?(?!\)),)|\s*+'[^'\r\n]*+'\s*+(?(?!\)),))++\))|[^()',\s]++|'[^'\r\n]*+')\s*+(?:,|$)

Regex101

Put it in string literal, with delimiter:

'/\G\s*+((\((?:\s*+(?2)\s*+(?(?!\)),)|\s*+[^()\',\s]++\s*+(?(?!\)),)|\s*+\'[^\'\r\n]*+\'\s*+(?(?!\)),))++\))|[^()\',\s]++|\'[^\'\r\n]*+\')\s*+(?:,|$)/'

ideone

The result is in capturing group 1. In the example on ideone, I specify PREG_OFFSET_CAPTURE flag, so that you can check against the last match in group 0 (entire match) whether the entire source string has been consumed or not.

Assumptions

  • Non-quoted text may not contain any whitespace character, as defined by \s. Consequently, it may not span multiple lines.
  • Non-quoted text may not contain (, ), ' or ,.
  • Non-quoted text must contain at least 1 character.
  • Single quoted text may not span multiple lines.
  • Single quoted text may not contain quote. Consequently, there is no way to specify '.
  • Single quoted text may be empty.
  • Bracket token contains one or more of the following as sub-tokens: non-quoted text token, single quoted text token, or another bracket token.
  • In bracket token, 2 adjacent sub-tokens are separated by exactly one ,
  • Bracket token starts with ( and ends with ).
  • Consequently, a bracket token must have balanced brackets, and empty bracket () is not allowed.
  • Input will contain one or more of: non-quoted text, single quoted text or bracket token. The tokens in the input are separated with comma ,. Single trailing comma , is considered valid.
  • Whitespace character (as defined by \s, which includes new line character) are arbitrarily allowed between token(s), comma(s) , separating tokens, and the bracket(s) (, ) of the bracket tokens.

Breakdown

\G\s*+
(
  (
    \(
    (?:
        \s*+
        (?2)
        \s*+
        (?(?!\)),)
      |
        \s*+
        [^()',\s]++
        \s*+
        (?(?!\)),)
      |
        \s*+
        '[^'\r\n]*+'
        \s*+
        (?(?!\)),)
    )++
    \)
  )
  |
  [^()',\s]++
  |
  '[^'\r\n]*+'
)
\s*+(?:,|$)
nhahtdh
  • 55,989
  • 15
  • 126
  • 162