7

Two days ago I started working on a code parser and I'm stuck.

How can I split a string by commas that are not inside brackets, let me show you what I mean:

I have this string to parse:

one, two, three, (four, (five, six), (ten)), seven

I would like to get this result:

array(
 "one"; 
 "two"; 
 "three"; 
 "(four, (five, six), (ten))"; 
 "seven"
)

but instead I get:

array(
  "one"; 
  "two"; 
  "three"; 
  "(four"; 
  "(five"; 
  "six)"; 
  "(ten))";
  "seven"
)

How can I do this in PHP RegEx.

Thank you in advance !

Cristian Toma
  • 5,662
  • 2
  • 36
  • 43

8 Answers8

13

You can do that easier:

preg_match_all('/[^(,\s]+|\([^)]+\)/', $str, $matches)

But it would be better if you use a real parser. Maybe something like this:

$str = 'one, two, three, (four, (five, six), (ten)), seven';
$buffer = '';
$stack = array();
$depth = 0;
$len = strlen($str);
for ($i=0; $i<$len; $i++) {
    $char = $str[$i];
    switch ($char) {
    case '(':
        $depth++;
        break;
    case ',':
        if (!$depth) {
            if ($buffer !== '') {
                $stack[] = $buffer;
                $buffer = '';
            }
            continue 2;
        }
        break;
    case ' ':
        if (!$depth) {
            continue 2;
        }
        break;
    case ')':
        if ($depth) {
            $depth--;
        } else {
            $stack[] = $buffer.$char;
            $buffer = '';
            continue 2;
        }
        break;
    }
    $buffer .= $char;
}
if ($buffer !== '') {
    $stack[] = $buffer;
}
var_dump($stack);
Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • Yes, it's easier, but doesn't work in case of nested brackets, like so: one, two, three, (four, (five, six), (ten)), seven – Cristian Toma Jul 06 '09 at 07:41
  • That’s the point where you have to use a real parser. Regular expressions cannot count or handle states. – Gumbo Jul 06 '09 at 07:49
  • I have to use regular expressions. Regular expressions are recursive and greedy, you can accomplish this using them. – Cristian Toma Jul 06 '09 at 07:52
  • No you can’t. Sure, there are features in modern implementations that can accomplish that such like .NET’s *Balancing group* `(? … )` http://msdn.microsoft.com/bs2twtah.aspx. But they use a state machine and that’s no longer a regular expression in the classical manner. – Gumbo Jul 06 '09 at 08:18
  • This one is more correct, but still not working for nested parenthesis /[^(,]*(?:\([^)]+\))?[^),]*/ – DarkSide Mar 24 '13 at 23:09
6

Hm... OK already marked as answered, but since you asked for an easy solution I will try nevertheless:

$test = "one, two, three, , , ,(four, five, six), seven, (eight, nine)";
$split = "/([(].*?[)])|(\w)+/";
preg_match_all($split, $test, $out);
print_r($out[0]);              

Output

Array
(
    [0] => one
    [1] => two
    [2] => three
    [3] => (four, five, six)
    [4] => seven
    [5] => (eight, nine)
)
Sjon
  • 4,989
  • 6
  • 28
  • 46
merkuro
  • 6,161
  • 2
  • 27
  • 29
  • Thank you very much, your help is much appreciated. But now I realize that I will also encounter nested brackets and your solution doesn't apply. – Cristian Toma Jul 06 '09 at 07:43
4

You can't, directly. You'd need, at minimum, variable-width lookbehind, and last I knew PHP's PCRE only has fixed-width lookbehind.

My first recommendation would be to first extract parenthesized expressions from the string. I don't know anything about your actual problem, though, so I don't know if that will be feasible.

chaos
  • 122,029
  • 33
  • 303
  • 309
  • Yes, that was the hack I was planing to use. Replace the brackets with $1, $2 or something similar, split the string and than restore the brackets in the result. Thank you ! – Cristian Toma Jul 05 '09 at 20:48
  • The point is that what you describe is not a regular language, so regular expressions are an ill fit. So, parsing out all the nested parts first is not a "hack" but the most sensible thing to do. – Svante Jul 06 '09 at 08:30
2

I can't think of a way to do it using a single regex, but it's quite easy to hack together something that works:

function process($data)
{
        $entries = array();
        $filteredData = $data;
        if (preg_match_all("/\(([^)]*)\)/", $data, $matches)) {
                $entries = $matches[0];
                $filteredData = preg_replace("/\(([^)]*)\)/", "-placeholder-", $data);
        }

        $arr = array_map("trim", explode(",", $filteredData));

        if (!$entries) {
                return $arr;
        }

        $j = 0;
        foreach ($arr as $i => $entry) {
                if ($entry != "-placeholder-") {
                        continue;
                }

                $arr[$i] = $entries[$j];
                $j++;
        }

        return $arr;
}

If you invoke it like this:

$data = "one, two, three, (four, five, six), seven, (eight, nine)";
print_r(process($data));

It outputs:

Array
(
    [0] => one
    [1] => two
    [2] => three
    [3] => (four, five, six)
    [4] => seven
    [5] => (eight, nine)
)
Emil H
  • 39,840
  • 10
  • 78
  • 97
  • Thank you very much, this should work. This was how I planned to do it first, but I thought that an easier way exists. – Cristian Toma Jul 05 '09 at 21:07
  • You're method can not parse "one, two, three, ((five), (four(six))), seven, eight, nine". I think the correct RegEx would be a recursive one: /\(([^()]+|(?R))*\)/. – Cristian Toma Jul 06 '09 at 07:26
  • You didn't mention that it had to be able to parse recursive expressions back when I first wrote this answer, though. Still, others have definately suggested better solutions after I wrote this. – Emil H Jul 06 '09 at 07:50
2

Maybe a bit late but I've made a solution without regex which also supports nesting inside brackets. Anyone let me know what you guys think:

$str = "Some text, Some other text with ((95,3%) MSC)";
$arr = explode(",",$str);

$parts = [];
$currentPart = "";
$bracketsOpened = 0;
foreach ($arr as $part){
    $currentPart .= ($bracketsOpened > 0 ? ',' : '').$part;
    if (stristr($part,"(")){
        $bracketsOpened ++;
    }
    if (stristr($part,")")){
        $bracketsOpened --;                 
    }
    if (!$bracketsOpened){
        $parts[] = $currentPart;
        $currentPart = '';
    }
}

Gives me the output:

Array
(
    [0] => Some text
    [1] =>  Some other text with ((95,3%) MSC)
)
1

Clumsy, but it does the job...

<?php

function split_by_commas($string) {
  preg_match_all("/\(.+?\)/", $string, $result); 
  $problem_children = $result[0];
  $i = 0;
  $temp = array();
  foreach ($problem_children as $submatch) { 
    $marker = '__'.$i++.'__';
    $temp[$marker] = $submatch;
    $string   = str_replace($submatch, $marker, $string);  
  }
  $result = explode(",", $string);
  foreach ($result as $key => $item) {
    $item = trim($item);
    $result[$key] = isset($temp[$item])?$temp[$item]:$item;
  }
  return $result;
}


$test = "one, two, three, (four, five, six), seven, (eight, nine), ten";

print_r(split_by_commas($test));

?>
Dycey
  • 4,767
  • 5
  • 47
  • 86
1

I feel that its worth noting, that you should always avoid regular expressions when you possibly can. To that end, you should know that for PHP 5.3+ you could use str_getcsv(). However, if you're working with files (or file streams), such as CSV files, then the function fgetcsv() might be what you need, and its been available since PHP4.

Lastly, I'm surprised nobody used preg_split(), or did it not work as needed?

ken
  • 3,650
  • 1
  • 30
  • 43
0

I am afraid that it could be very difficult to parse nested brackets like one, two, (three, (four, five)) only with RegExp.

MyKey_
  • 837
  • 1
  • 7
  • 22