Parsing command arguments in PHP

Question

Is there a native "PHP way" to parse command arguments from a string? For example, given the following string:

foo "bar \"baz\"" '\'quux\''

I'd like to create the following array:

array(3) {
  [0] =>
  string(3) "foo"
  [1] =>
  string(7) "bar "baz""
  [2] =>
  string(6) "'quux'"
}

I've already tried to leverage token_get_all(), but PHP's variable interpolation syntax (e.g. "foo ${bar} baz") pretty much rained on my parade.

I know full well that I could write my own parser. Command argument syntax is super simplistic, but if there's an existing native way to do it, I'd much prefer that over rolling my own.

EDIT: Please note that I am looking to parse the arguments from a string, NOT from the shell/command-line.

EDIT #2: Below is a more comprehensive example of the expected input -> output for arguments:

foo -> foo
"foo" -> foo
'foo' -> foo
"foo'foo" -> foo'foo
'foo"foo' -> foo"foo
"foo\"foo" -> foo"foo
'foo\'foo' -> foo'foo
"foo\foo" -> foo\foo
"foo\\foo" -> foo\foo
"foo foo" -> foo foo
'foo foo' -> foo foo

no as there's no regular separator in your string, couldn't you format it so there was? — , Jul 25 '13 at 03:42
@dagon I think he is examining a string that is a command ... not looking at arguments passed in. — Orangepill, Jul 25 '13 at 03:57
Yes, @Orangepill is correct. The command is inside a string. Sorry for the confusion. — FtDRbwLXw6, Jul 25 '13 at 04:00
[A quick fiddle](http://regex101.com/r/vS9vB8). You should be using group **0**. — HamZa, Aug 11 '13 at 20:29
@HamZa: I don't think it's possible to do this correctly with a regular expression. Your example fiddle fails at several test cases, including simple ones like `'foo"bar' "baz'boz"` which should capture `foo"bar` and `baz'boz`. — FtDRbwLXw6, Aug 12 '13 at 14:33
Sorry, maybe I didn't get what you're asking, but if I run this command `php foo.php "bar \"baz\"" "'quux'"` and do a `var_dump($argv);` in `foo.php` I get exactly what you want. I repeat, maybe I misunderstand :) — Sylter, Aug 13 '13 at 16:18
@Sylter: I added an edit to the bottom of my question which addresses the confusion. — FtDRbwLXw6, Aug 13 '13 at 16:25
no clue if it's really super-simplistic. your question for example does not show what happens with unquoted but escaped spaces for eaxmple. https://eval.in/private/120da2a46daf7e (scroll down for a list of some common test cases) — hakre, Aug 13 '13 at 21:51

score 12 · Answer 1 · edited May 23 '17 at 11:52

Regexes are quite powerful: (?s)(?<!\\)("|')(?:[^\\]|\\.)*?\1|\S+. So what does this expression mean ?

(?s) : set the s modifier to match newlines with a dot .
(?<!\\) : negative lookbehind, check if there is no backslash preceding the next token
("|') : match a single or double quote and put it in group 1
(?:[^\\]|\\.)*? : match everything not \, or match \ with the immediately following (escaped) character
\1 : match what is matched in the first group
| : or
\S+ : match anything except whitespace one or more times.

The idea is to capture a quote and group it to remember if it's a single or a double one. The negative lookbehinds are there to make sure we don't match escaped quotes. \1 is used to match the second pair of quotes. Finally we use an alternation to match anything that's not a whitespace. This solution is handy and is almost applicable for any language/flavor that supports lookbehinds and backreferences. Of course, this solution expects that the quotes are closed. The results are found in group 0.

Let's implement it in PHP:

$string = <<<INPUT
foo "bar \"baz\"" '\'quux\''
'foo"bar' "baz'boz"
hello "regex

world\""
"escaped escape\\\\"
INPUT;

preg_match_all('#(?<!\\\\)("|\')(?:[^\\\\]|\\\\.)*?\1|\S+#s', $string, $matches);
print_r($matches[0]);

If you wonder why I used 4 backslashes. Then take a look at my previous answer.

Output

Array
(
    [0] => foo
    [1] => "bar \"baz\""
    [2] => '\'quux\''
    [3] => 'foo"bar'
    [4] => "baz'boz"
    [5] => hello
    [6] => "regex

world\""
    [7] => "escaped escape\\"
)

Online regex demo Online php demo

Removing the quotes

Quite simple using named groups and a simple loop:

preg_match_all('#(?<!\\\\)("|\')(?<escaped>(?:[^\\\\]|\\\\.)*?)\1|(?<unescaped>\S+)#s', $string, $matches, PREG_SET_ORDER);

$results = array();
foreach($matches as $array){
   if(!empty($array['escaped'])){
      $results[] = $array['escaped'];
   }else{
      $results[] = $array['unescaped'];
   }
}
print_r($results);

Online php demo

@Wrikken Thanks for the feedback, I'm thinking of a solution. — HamZa, Aug 13 '13 at 20:02
`preg_match_all('#(?<!\\\\)("|\')(?:[^\\\\]|\\\\.)*?\1|\S+#s', $string, $matches);` — Wrikken, Aug 13 '13 at 20:11
The problem with this solution, as well as @ircmaxell's is that escape characters are being left in the parsed arguments. If you notice in my example in the question, escape characters are not in the resulting output. This is why I don't believe regexes (by themselves) are able to do the job. I would need to parse the resulting output again to properly remove the escape characters, in order to get the desired output. — FtDRbwLXw6, Aug 14 '13 at 18:22

Ja͢ck · Answer 2 · 2013-08-15T03:38:10.037

I've worked out the following expression to match the various enclosures and escapement:

$pattern = <<<REGEX
/
(?:
  " ((?:(?<=\\\\)"|[^"])*) "
|
  ' ((?:(?<=\\\\)'|[^'])*) '
|
  (\S+)
)
/x
REGEX;

preg_match_all($pattern, $input, $matches, PREG_SET_ORDER);

It matches:

Two double quotes, inside of which a double quote may be escaped
Same as #1 but for single quotes
Unquoted string

Afterwards, you need to (carefully) remove the escaped characters:

$args = array();
foreach ($matches as $match) {
    if (isset($match[3])) {
        $args[] = $match[3];
    } elseif (isset($match[2])) {
        $args[] = str_replace(['\\\'', '\\\\'], ["'", '\\'], $match[2]);
    } else {
        $args[] = str_replace(['\\"', '\\\\'], ['"', '\\'], $match[1]);
    }
}
print_r($args);

Update

For the fun of it, I've written a more formal parser, outlined below. It won't give you better performance, it's about three times slower than the regular expression mostly due its object oriented nature. I suppose the advantage is more academic than practical:

class ArgvParser2 extends StringIterator
{
    const TOKEN_DOUBLE_QUOTE = '"';
    const TOKEN_SINGLE_QUOTE = "'";
    const TOKEN_SPACE = ' ';
    const TOKEN_ESCAPE = '\\';

    public function parse()
    {
        $this->rewind();

        $args = [];

        while ($this->valid()) {
            switch ($this->current()) {
                case self::TOKEN_DOUBLE_QUOTE:
                case self::TOKEN_SINGLE_QUOTE:
                    $args[] = $this->QUOTED($this->current());
                    break;

                case self::TOKEN_SPACE:
                    $this->next();
                    break;

                default:
                    $args[] = $this->UNQUOTED();
            }
        }

        return $args;
    }

    private function QUOTED($enclosure)
    {
        $this->next();
        $result = '';

        while ($this->valid()) {
            if ($this->current() == self::TOKEN_ESCAPE) {
                $this->next();
                if ($this->valid() && $this->current() == $enclosure) {
                    $result .= $enclosure;
                } elseif ($this->valid()) {
                    $result .= self::TOKEN_ESCAPE;
                    if ($this->current() != self::TOKEN_ESCAPE) {
                        $result .= $this->current();
                    }
                }
            } elseif ($this->current() == $enclosure) {
                $this->next();
                break;
            } else {
                $result .= $this->current();
            }
            $this->next();
        }

        return $result;
    }

    private function UNQUOTED()
    {
        $result = '';

        while ($this->valid()) {
            if ($this->current() == self::TOKEN_SPACE) {
                $this->next();
                break;
            } else {
                $result .= $this->current();
            }
            $this->next();
        }

        return $result;
    }

    public static function parseString($input)
    {
        $parser = new self($input);

        return $parser->parse();
    }
}

It's based on StringIterator to walk through the string one character at a time:

class StringIterator implements Iterator
{
    private $string;

    private $current;

    public function __construct($string)
    {
        $this->string = $string;
    }

    public function current()
    {
        return $this->string[$this->current];
    }

    public function next()
    {
        ++$this->current;
    }

    public function key()
    {
        return $this->current;
    }

    public function valid()
    {
        return $this->current < strlen($this->string);
    }

    public function rewind()
    {
        $this->current = 0;
    }
}

+1 - This is actually the only answer that addresses removing of the escape characters. It failed on the `"foo\foo"` test case (it returned `foooo` instead of `foo\foo` or `foofoo`), but that's an easy fix. I suppose it's probably more performant than an FSM parser, too. — FtDRbwLXw6, Aug 14 '13 at 18:39
@drrcknlsn I've made the changes and tested against your input set :) — Ja͢ck, Aug 15 '13 at 02:36
@drrcknlsn I've expanded the answer with a formal parser; enjoy :) — Ja͢ck, Aug 15 '13 at 03:20

ircmaxell · Answer 3 · 2013-08-13T21:19:12.377

Well, you could also build this parser with a recursive regex:

$regex = "([a-zA-Z0-9.-]+|\"([^\"\\\\]+(?1)|\\\\.(?1)|)\"|'([^'\\\\]+(?2)|\\\\.(?2)|)')s";

Now that's a bit long, so let's break it out:

$identifier = '[a-zA-Z0-9.-]+';
$doubleQuotedString = "\"([^\"\\\\]+(?1)|\\\\.(?1)|)\"";
$singleQuotedString = "'([^'\\\\]+(?2)|\\\\.(?2)|)'";
$regex = "($identifier|$doubleQuotedString|$singleQuotedString)s";

So how does this work? Well, the identifier should be obvious...

The two quoted sub-patterns are basically, the same, so let's look at the single quoted string:

'([^'\\\\]+(?2)|\\\\.(?2)|)'

Really, that's a quote character followed by a recursive sub-pattern, followed by a end quote.

The magic happens in the sub-pattern.

[^'\\\\]+(?2)

That part basically consumes any non-quote and non-escape character. We don't care about them, so eat them up. Then, if we encounter either a quote or a backslash, trigger an attempt to match the entire sub-pattern again.

\\\\.(?2)

If we can consume a backslash, then consume the next character (without caring what it is), and recurse again.

Finally, we have an empty component (if the escaped character is last, or if there's no escape character).

Running this on the test input @HamZa provided returns the same result:

array(8) {
  [0]=>
  string(3) "foo"
  [1]=>
  string(13) ""bar \"baz\"""
  [2]=>
  string(10) "'\'quux\''"
  [3]=>
  string(9) "'foo"bar'"
  [4]=>
  string(9) ""baz'boz""
  [5]=>
  string(5) "hello"
  [6]=>
  string(16) ""regex

world\"""
  [7]=>
  string(18) ""escaped escape\\""
}

The main difference that happens is in terms of efficiency. This pattern should backtrack less (since it's a recursive pattern, there should be next to no backtracking for a well-formed string), where the other regex is a non-recursive regex and will backtrack every single character (that's what the ? after the * forces, non-greedy pattern consumption).

For short inputs this doesn't matter. The test case provided, they run within a few % of each other (margin of error is greater than the difference). But with a single long string with no escape sequences:

"with a really long escape sequence match that will force a large backtrack loop"

The difference is significant (100 runs):

Recursive: float(0.00030398368835449)
Backtracking: float(0.00055909156799316)

Of course, we can partially lose this advantage with a lot of escape sequences:

"This is \" A long string \" With a\lot \of \"escape \sequences"

Recursive: float(0.00040411949157715)
Backtracking: float(0.00045490264892578)

But note that the length still dominates. That's because the backtracker scales at O(n^2), where the recursive solution scales at O(n). However, since the recursive pattern always needs to recurse at least once, it's slower than the backtracking solution on short strings:

"1"

Recursive: float(0.0002598762512207)
Backtracking: float(0.00017595291137695)

The tradeoff appears to happen around 15 characters... But both are fast enough that it won't make a difference unless you're parsing several KB or MB of data... But it's worth discussing...

On sane inputs, it won't make a significant difference. But if you're matching more than a few hundred bytes, it may start to add up significantly...

Edit

If you need to handle arbitrary "bare words" (unquoted strings), then you can change the original regex to:

$regex = "([^\s'\"]\S*|\"([^\"\\\\]+(?1)|\\\\.(?1)|)\"|'([^'\\\\]+(?2)|\\\\.(?2)|)')s";

However, it really depends on your grammar and what you consider a command or not. I'd suggest formalizing the grammar you expect...

Well, good point about efficiency, however `bare\"arg` ? (in other words, I do not agree with `[a-zA-Z0-9.-]+`. Even if we account for more chars, my shell is fully utf-8 capable (whether that's a good choice to use is debatable of course, but it's possible). If you drop that from the end & add @HamZa's `|\S+` on the end... we may have a better one. — Wrikken, Aug 13 '13 at 21:07
@Wrikken: is `bare\"arg` valid? I would argue that should be a syntax error... But if you want to take shell rules, I would do `[^\s"']\S*` rather than just `\S+`... That way it must start with a non-white-space-or-quote character, and then can have whatever. But the grammar really should be defined better than that... — ircmaxell, Aug 13 '13 at 21:14
Environment matters, [according to this](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_02) _"The application shall quote the following characters..." & _"The various quoting mechanisms are the escape character, single-quotes, and double-quotes"_, which does not sound to me like this would be a syntax error. However `arg=öleböle` if you like, the problem with the original would still persist. `[^\s"']\S*` does indeed look the candidate to error out on unended quotes. How the OP wants to handle _invalid_ argument lists is a good question though. — Wrikken, Aug 13 '13 at 21:49
the `bare\"arg` is valid because you can use it to express that the `"` quote should not start quoting but should be a quote verbatim in that block. also `bare\ ark` is no syntax error as isn't `bare\\ark`. - @Wrikken thanks for the link! — hakre, Aug 13 '13 at 22:48
The grammar that I expect is "standard" shell argument syntax, where arguments are space-delimited, with the ability to quote arguments with spaces in them (using single/double quotes), and escape quotes within quoted arguments (using a backslash). I think the problem with most regex-based solutions (and, subsequently, why I don't think a pure regex solution exists) is that while they may be able to tokenize the arguments, they leave any escape characters in the tokens. This means I have to double parse the input so that, e.g. `"\\f\o\"o"` becomes `\fo"o` instead of `\\f\o\"o`. — FtDRbwLXw6, Aug 14 '13 at 18:17
@drrcknlsn: that's fair, but a simple call to `stripslashes()` on the result tokens will take care of that. So it shouldn't be the end of the world... — ircmaxell, Aug 14 '13 at 18:41
@ircmaxell: Oh, absolutely. I was using the term "problem" very loosely. Anyway, I think my question was confusing, as I was actually looking for a native parsing implementation in PHP that I could leverage, rather than how to write my own parser. I just figured that since it was such a common syntax, there might be something in PHP that could do the parsing for me. Given that many smarter PHP folks than I are resorting to custom regex parsers, I guess the answer is probably "no". I'll likely go with this parser, as the one I had was just a relatively slow FSM. — FtDRbwLXw6, Aug 14 '13 at 19:47

Baba · Answer 4 · 2013-08-15T15:02:43.347

You can simply just use str_getcsv and do few cosmetic surgery with stripslashes and trim

Example :

$str =<<<DATA
"bar \"baz\"" '\'quux\''
"foo"
'foo'
"foo'foo"
'foo"foo'
"foo\"foo"
'foo\'foo'
"foo\foo"
"foo\\foo"
"foo foo"
'foo foo' "foo\\foo" \'quux\' \"baz\" "foo'foo"
DATA;


$str = explode("\n", $str);

foreach($str as $line) {
    $line = array_map("stripslashes",str_getcsv($line," "));
    print_r($line);
}

Output

Array
(
    [0] => bar "baz"
    [1] => ''quux''
)
Array
(
    [0] => foo
)
Array
(
    [0] => 'foo'
)
Array
(
    [0] => foo'foo
)
Array
(
    [0] => 'foo"foo'
)
Array
(
    [0] => foo"foo
)
Array
(
    [0] => 'foo'foo'
)
Array
(
    [0] => foooo
)
Array
(
    [0] => foofoo
)
Array
(
    [0] => foo foo
)
Array
(
    [0] => 'foo
    [1] => foo'
    [2] => foofoo
    [3] => 'quux'
    [4] => "baz"
    [5] => foo'foo
)

Caution

There is nothing like a unversal format for argument is best you spesify specific format and the easiest have seen is CSV

Example

 app.php arg1 "arg 2" "'arg 3'" > 4

Using CSV you can simple have this output

Array
(
    [0] => app.php
    [1] => arg1
    [2] => arg 2
    [3] => 'arg 3'
    [4] => >
    [5] => 4
)

+1 - This is probably my favorite answer. I never thought to leverage `str_getcsv()` for this. The function already supports escape characters as the 4th argument, but it seems to be broken (it doesn't actually remove the escape character from the output). I found [this bug report about it](https://bugs.php.net/bug.php?id=55413). Anyway, I don't think I can accept this answer, because it doesn't support single quotes (`'`) as delimiters in the same input as double quotes (`"`), but I'm going to play with it for a while and see what can be done. — FtDRbwLXw6, Aug 15 '13 at 13:22
You can use custom `enclosure` and `escape` character ... That is very easy to achieve ... but like i said i would advice against custom format and you should make it standard instead .... — Baba, Aug 15 '13 at 15:02

score 5 · Answer 5 · edited Jan 25 '21 at 19:11

If you want to follow the rules of such parsing that are there as well as in shell, there are some edge-cases which I think aren't easy to cover with regular expressions and therefore you might want to write a method that does this:

$string = 'foo "bar \"baz\"" \'\\\'quux\\\'\'';
echo $string, "\n";
print_r(StringUtil::separate_quoted($string));

Output:

foo "bar \"baz\"" '\'quux\''
Array
(
    [0] => foo
    [1] => bar "baz"
    [2] => 'quux'
)

I guess this pretty much matches what you're looking for. The function used in the example can be configured for the escape character as well as for the quotes, you can even use parenthesis like [ ] to form a "quote" if you like.

To allow other than native bytesafe-strings with one character per byte you can pass an array instead of a string. the array needs to contain one character per value as a binary safe string. e.g. pass unicode in NFC form as UTF-8 with one code-point per array value and this should do the job for unicode.

score 2 · Answer 6 · answered Aug 09 '13 at 19:51

Since you request a native way to do this, and PHP doesn't provide any function that would map $argv creation, you could workaround this lack like this :

Create an executable PHP script foo.php :

<?php

// Skip this file name
array_shift( $argv );

// output an valid PHP code
echo 'return '. var_export( $argv, 1 ).';';

?>

And use it to retrieve arguments, the way PHP will actually do if you exec $command :

function parseCommand( $command )
{
    return eval(
        shell_exec( "php foo.php ".$command )
    );
}


$command = <<<CMD
foo "bar \"baz\"" '\'quux\''
CMD;


$args = parseCommand( $command );

var_dump( $args );

Advantages :

Very simple code
Should be faster than any regular expression
100% close to PHP behavior

Drawbacks :

Requires execution privilege on the host
Shell exec + eval on the same $var, let's party ! You have to trust input or to do so much filtering that simple regexp may be be faster (I dindn't dig deep into that).

+1 - This is a rather clever way to approach the problem, but the input is 100% user-provided, so I'm hesitant to leverage the shell for this. I'm not really concerned at all about performance; just trying not to solve a problem that's already been solved. It's looking like I may just need to roll my own parser. — FtDRbwLXw6, Aug 12 '13 at 14:37
Holy Security Vulnerability Batman! `$command = "something"; rm -Rf *` :-P — ircmaxell, Aug 13 '13 at 20:59

score 1 · Answer 7 · answered Aug 12 '13 at 21:00

1

I would recommend going another way. There is already a "standard" way of doing command line arguments. it's called get_opts:

http://php.net/manual/en/function.getopt.php

I would suggest that you change your script to use get_opts, then anyone using your script will be passing parameters in a way that is familiar to them and kind of "industry standard" instead of having to learn your way of doing things.

answered Aug 12 '13 at 21:00

Zak

24,947
11
38
68

2

Thank you for the answer, but this is not for arguments coming into PHP. If it were, I would just use `$argv`. I'm looking to extract these arguments out of a string, not from the shell. – FtDRbwLXw6 Aug 12 '13 at 21:20

score 0 · Answer 8 · edited May 23 '17 at 11:59

0

Based on HamZa's answer:

function parse_cli_args($cmd) {
    preg_match_all('#(?<!\\\\)("|\')(?<escaped>(?:[^\\\\]|\\\\.)*?)\1|(?<unescaped>\S+)#s', $cmd, $matches, PREG_SET_ORDER);
    $results = [];
    foreach($matches as $array){
        $results[] = !empty($array['escaped']) ? $array['escaped'] : $array['unescaped'];
    }
    return $results;
}

edited May 23 '17 at 11:59

Community

1
1

answered Apr 15 '14 at 19:06

mpen

272,448
266
850
1,236

score 0 · Answer 9 · edited Jan 26 '21 at 06:50

I wrote some packages for console interactions:

Link to the new package: weew/console-arguments

There is also a cli application scaffold built around that package: weew/console

Link to the cli output formatter: weew/console-formatter

Arguments parsing

There is a package that does the whole arguments parsing thing weew/console-arguments

Example:

$parser = new ArgumentsParser();
$args = $parser->parse('command:name arg1 arg2 --flag="custom \"value" -f="1+1=2" -vvv');

$args will be an array:

['command:name', 'arg1', 'arg2', '--flag', 'custom "value', '-f', '1+1=2', '-v', '-v', '-v']

Arguments can be grouped:

$args = $parser->group($args);

$args will become:

['arguments' => ['command:name', 'arg1', 'arg2'], 'options' => ['--flag' => 1, '-f' => 1, '-v' => 1], '--flag' => ['custom "value'], '-f' => ['1+1=2'], '-v' => []]

Note: This solutions are not native but might still be useful to some people.

score -1 · Answer 10 · edited May 23 '17 at 12:31

I suggest something like:

$str = <<<EOD
foo "bar \"baz\"" '\'quux\''
EOD;

$match = preg_split("/('(?:.*)(?<!\\\\)(?>\\\\\\\\)*'|\"(?:.*)(?<!\\\\)(?>\\\\\\\\)*\")/U", $str, null, PREG_SPLIT_DELIM_CAPTURE);

var_dump(array_filter(array_map('trim', $match)));

With some assistance from: string to array, split by single and double quotes for the regexp

You still have to unescape the strings in the array after.

array(3) {
  [0]=>
  string(3) "foo"
  [1]=>
  string(13) ""bar \"baz\"""
  [3]=>
  string(10) "'\'quux\''"
}

But you get the picture.

score -1 · Answer 11 · answered Aug 10 '13 at 04:13

There really is no native function for parsing commands to my knowledge. However, I have created a function which does the trick natively in PHP. By using str_replace several times, you are able to convert the string into something array convertible. I don't know how fast you consider fast, but when running the query 400 times, the slowest query was under 34 microseconds.

function get_array_from_commands($string) {
    /*
    **  Turns a command string into a field
    **  of arrays through multiple lines of 
    **  str_replace, until we have a single
    **  string to split using explode().
    **  Returns an array.
    */

    // replace single quotes with their related
    // ASCII escape character
    $string = str_replace("\'","&#x27;",$string);
    // Do the same with double quotes
    $string = str_replace("\\\"","&quot;",$string);
    // Now turn all remaining single quotes into double quotes
    $string = str_replace("'","\"",$string);
    // Turn " " into " so we don't replace it too many times
    $string = str_replace("\" \"","\"",$string);
    // Turn the remaining double quotes into @@@ or some other value
    $string = str_replace("\"","@@@",$string);
    // Explode by @@@ or value listed above
    $string = explode("@@@",$string);
    return $string;
}

This method is error-prone, and non-exhaustive (e.g. what if data contains the literal "@@@"?). I would much rather write a proper parser. — FtDRbwLXw6, Aug 12 '13 at 14:40

Parsing command arguments in PHP

11 Answers11

Edit

Linked