1

I've found several partial answers to this question, but none that cover all my needs...

I am trying to parse a user generated string as if it were a series of php function arguments to determine the number of arguments:

This string:

$arg1,$arg2='ABC,DEF',$arg3="GHI\",JKL",$arg4=array(1,'2)',"3\"),")

will be inserted as the arguments of a function:

function my_function( [insert string here] ){ ... }

I need to parse the string on the commas, taking into account single- and double-quotes, parentheses, and escaped quotes and parentheses to create an array:

array(4) {
  [0] => $arg1
  [1] => $arg2='ABC,DEF'
  [2] => $arg3="GHI\",JKL"
  [3] => $arg4=array(1,'2)',"3\"),")
}

Any help with a regular expression or parser function to accomplish this is appreciated!

Matt
  • 273
  • 1
  • 2
  • 10
  • http://stackoverflow.com/questions/17118032/regular-expression-to-extract-php-code-partially-array-definition/17134110#17134110 – hwnd May 23 '15 at 05:11

3 Answers3

1

It isn't possible to solve this problem with a classical csv tool since there is more than one character able to protect parts of the string. Using preg_split is possible but will result in a very complicated and inefficient pattern. So the best way is to use preg_match_all. There are however several problems to solve:

  • as needed, a comma enclosed in quotes or parenthesis must be ignored (seen as a character without special meaning, not as a delimiter)
  • you need to extract the params, but you need to check if the string has the good format too, otherwise the match results may be totally false!

For the first point, you can define subpatterns to describe each cases: the quoted parts, the parts enclosed between parenthesis, and a more general subpattern able to match a complete param and that uses the two previous subpatterns when needed.

Note that the parenthesis subpattern needs to refer to the general subpattern too, since it can contain anything (and commas too).

The second point can be solved using the \G anchor that ensures that all matchs are contiguous. But you need to be sure that the end of the string has been reached. To do that, you can add an optional empty capture group at the end of the main pattern that is created only if the anchor for the end of the string \z succeeds.

$subject = <<<'EOD'
$arg1,$arg2='ABC,DEF',$arg3="GHI\",JKL",$arg4=array(1,'2)',"3\"),")
EOD;

$pattern = <<<'EOD'
~
  # named groups definitions
  (?(DEFINE) # this definition group allows to define the subpatterns you want
             # without matching anything
      (?<quotes>
          ' [^'\\]*+ (?s:\\.[^'\\]*)*+ ' | " [^"\\]*+ (?s:\\.[^"\\]*)*+ "
      )
      (?<brackets> \( \g<content> (?: ,+ \g<content> )*+ \) )
      (?<content> [^,'"()]*+        # ' # (<-- comment for SO syntax highlighting)
                  (?:
                      (?: \g<brackets> | \g<quotes> )
                      [^,'"()]*     # ' #
                  )*+
      )
  )
  # the main pattern
  (?: # two possible beginings
      \G(?!\A) , # a comma contiguous to a previous match
    |            #  OR
      \A         # the start of the string
  ) 
  (?<param> \g<content> )
  (?: \z (?<check>) )? # create an item "check" when the end is reached
~x
EOD;

$result = false;

if ( preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER) &&
     isset(end($matches)['check']) )
    $result = array_map(function ($i) { return $i['param']; }, $matches);
else 
   echo 'bad format' . PHP_EOL;

var_dump($result);

demo

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Very thorough solution. This also works if the ,$ delimeter falls inside quotes. I'm going to take some more time to digest, but I did make one slight tweak to accommodate for potential whitespace surrounding commas. I changed this: \G(?!\A) , to this: \G(?!\A)\s*,\s* [demo](https://eval.in/369061) – Matt May 23 '15 at 12:34
  • @Matt: this pattern is a bit general and allows for example empty params, but feel free to be more explicit when needed. For example you can change `(? \g )` to something like this `(? \$\w+ = \g )` – Casimir et Hippolyte May 23 '15 at 12:52
  • @Matt: about the trims of params, you can perform them too in `array_map` like this: `return trim($i['param']);` – Casimir et Hippolyte May 23 '15 at 12:54
  • @Matt: after reflection, changing to `\G(?!\A) \s*,\s*` is a bad idea because `\g` is greedy, so can't trim on the right like this. – Casimir et Hippolyte May 23 '15 at 12:58
0

You could split the argument string at ,$ and then append $ back the array values:

$args_array = explode(',$', $arg_str);
foreach($args_array as $key => $arg_raw) {
    $args_array[$key] = '$'.ltrim($arg_raw, '$');
}
print_r($args_array);

Output:

(
    [0] => $arg1
    [1] => $arg2='ABC,DEF'
    [2] => $arg3="GHI\",JKL"
    [3] => $arg4=array(1,'2)',"3\"),")
)
Ulver
  • 905
  • 8
  • 13
  • 1
    Unfortunately this fails if the delimiter is inside quotes: $arg="...,$..." Would produce [0] => $arg="... [1] => $..." – Matt May 23 '15 at 11:44
0

If you want to use a regex, you can use something like this:

(.+?)(?:,(?=\$)|$)

Working demo

Php code:

$re = '/(.+?)(?:,(?=\$)|$)/'; 
$str = "\$arg1,\$arg2='ABC,DEF',\$arg3=\"GHI\",JKL\",\$arg4=array(1,'2)',\"3\"),\")\n"; 

preg_match_all($re, $str, $matches);

Match information:

MATCH 1
1.  [0-5]   `$arg1`
MATCH 2
1.  [6-21]  `$arg2='ABC,DEF'`
MATCH 3
1.  [22-39] `$arg3="GHI\",JKL"`
MATCH 4
1.  [40-67] `$arg4=array(1,'2)',"3\"),")`
Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
  • 1
    As in the solution provided by Ulver, this fails if the ,$ pattern falls inside a quoted pairing: $arg="...,$..." – Matt May 23 '15 at 11:59