14

I have php code stored (( array definition )) in a string like this

$code=' array(

  0  => "a",
 "a" => $GlobalScopeVar,
 "b" => array("nested"=>array(1,2,3)),  
 "c" => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; },

); ';

there is a regular expression to extract this array??, i mean i want something like

$array=(  

  0  => '"a"',
 'a' => '$GlobalScopeVar',
 'b' => 'array("nested"=>array(1,2,3))',
 'c' => 'function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }',

);

pD :: i do research trying to find a regular expression but nothing was found.
pD2 :: gods of stackoverflow, let me bounty this now and i will offer 400 :3
pD3 :: this will be used in a internal app, where i need extract an array of some php file to be 'processed' in parts, i try explain with this codepad.org/td6LVVme

AgelessEssence
  • 6,395
  • 5
  • 33
  • 37

5 Answers5

31

Regex

So here's the MEGA regex I came up with:

\s*                                     # white spaces
########################## KEYS START ##########################
(?:                                     # We\'ll use this to make keys optional
(?P<keys>                               # named group: keys
\d+                                     # match digits
|                                       # or
"(?(?=\\\\")..|[^"])*"                  # match string between "", works even 4 escaped ones "hello \" world"
|                                       # or
\'(?(?=\\\\\')..|[^\'])*\'              # match string between \'\', same as above :p
|                                       # or
\$\w+(?:\[(?:[^[\]]|(?R))*\])*          # match variables $_POST, $var, $var["foo"], $var["foo"]["bar"], $foo[$bar["fail"]]
)                                       # close group: keys
########################## KEYS END ##########################
\s*                                     # white spaces
=>                                      # match =>
)?                                      # make keys optional
\s*                                     # white spaces
########################## VALUES START ##########################
(?P<values>                             # named group: values
\d+                                     # match digits
|                                       # or
"(?(?=\\\\")..|[^"])*"                  # match string between "", works even 4 escaped ones "hello \" world"
|                                       # or
\'(?(?=\\\\\')..|[^\'])*\'              # match string between \'\', same as above :p
|                                       # or
\$\w+(?:\[(?:[^[\]]|(?R))*\])*          # match variables $_POST, $var, $var["foo"], $var["foo"]["bar"], $foo[$bar["fail"]]
|                                       # or
array\s*\((?:[^()]|(?R))*\)             # match an array()
|                                       # or
\[(?:[^[\]]|(?R))*\]                    # match an array, new PHP array syntax: [1, 3, 5] is the same as array(1,3,5)
|                                       # or
(?:function\s+)?\w+\s*                  # match functions: helloWorld, function name
(?:\((?:[^()]|(?R))*\))                 # match function parameters (wut), (), (array(1,2,4))
(?:(?:\s*use\s*\((?:[^()]|(?R))*\)\s*)? # match use(&$var), use($foo, $bar) (optionally)
\{(?:[^{}]|(?R))*\}                     # match { whatever}
)?;?                                    # match ; (optionally)
)                                       # close group: values
########################## VALUES END ##########################
\s*                                     # white spaces

I've put some comments, note that you need to use 3 modifiers:
x : let's me make comments s : match newlines with dots i : match case insensitive

PHP

$code='array(0  => "a", 123 => 123, $_POST["hello"][\'world\'] => array("is", "actually", "An array !"), 1234, \'got problem ?\', 
 "a" => $GlobalScopeVar, $test_further => function test($noway){echo "this works too !!!";}, "yellow" => "blue",
 "b" => array("nested"=>array(1,2,3), "nested"=>array(1,2,3),"nested"=>array(1,2,3)), "c" => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }
  "bug", "fixed", "mwahahahaa" => "Yeaaaah"
);'; // Sample data

$code = preg_replace('#(^\s*array\s*\(\s*)|(\s*\)\s*;?\s*$)#s', '', $code); // Just to get ride of array( at the beginning, and ); at the end

preg_match_all('~
\s*                                     # white spaces
########################## KEYS START ##########################
(?:                                     # We\'ll use this to make keys optional
(?P<keys>                               # named group: keys
\d+                                     # match digits
|                                       # or
"(?(?=\\\\")..|[^"])*"                  # match string between "", works even 4 escaped ones "hello \" world"
|                                       # or
\'(?(?=\\\\\')..|[^\'])*\'              # match string between \'\', same as above :p
|                                       # or
\$\w+(?:\[(?:[^[\]]|(?R))*\])*          # match variables $_POST, $var, $var["foo"], $var["foo"]["bar"], $foo[$bar["fail"]]
)                                       # close group: keys
########################## KEYS END ##########################
\s*                                     # white spaces
=>                                      # match =>
)?                                      # make keys optional
\s*                                     # white spaces
########################## VALUES START ##########################
(?P<values>                             # named group: values
\d+                                     # match digits
|                                       # or
"(?(?=\\\\")..|[^"])*"                  # match string between "", works even 4 escaped ones "hello \" world"
|                                       # or
\'(?(?=\\\\\')..|[^\'])*\'              # match string between \'\', same as above :p
|                                       # or
\$\w+(?:\[(?:[^[\]]|(?R))*\])*          # match variables $_POST, $var, $var["foo"], $var["foo"]["bar"], $foo[$bar["fail"]]
|                                       # or
array\s*\((?:[^()]|(?R))*\)             # match an array()
|                                       # or
\[(?:[^[\]]|(?R))*\]                    # match an array, new PHP array syntax: [1, 3, 5] is the same as array(1,3,5)
|                                       # or
(?:function\s+)?\w+\s*                  # match functions: helloWorld, function name
(?:\((?:[^()]|(?R))*\))                 # match function parameters (wut), (), (array(1,2,4))
(?:(?:\s*use\s*\((?:[^()]|(?R))*\)\s*)? # match use(&$var), use($foo, $bar) (optionally)
\{(?:[^{}]|(?R))*\}                     # match { whatever}
)?;?                                    # match ; (optionally)
)                                       # close group: values
########################## VALUES END ##########################
\s*                                     # white spaces
~xsi', $code, $m); // Matching :p

print_r($m['keys']); // Print keys
print_r($m['values']); // Print values


// Since some keys may be empty in case you didn't specify them in the array, let's fill them up !
foreach($m['keys'] as $index => &$key){
    if($key === ''){
        $key = 'made_up_index_'.$index;
    }
}
$results = array_combine($m['keys'], $m['values']);
print_r($results); // printing results

Output

Array
(
    [0] => 0
    [1] => 123
    [2] => $_POST["hello"]['world']
    [3] => 
    [4] => 
    [5] => "a"
    [6] => $test_further
    [7] => "yellow"
    [8] => "b"
    [9] => "c"
    [10] => 
    [11] => 
    [12] => "mwahahahaa"
    [13] => "this is"
)
Array
(
    [0] => "a"
    [1] => 123
    [2] => array("is", "actually", "An array !")
    [3] => 1234
    [4] => 'got problem ?'
    [5] => $GlobalScopeVar
    [6] => function test($noway){echo "this works too !!!";}
    [7] => "blue"
    [8] => array("nested"=>array(1,2,3), "nested"=>array(1,2,3),"nested"=>array(1,2,3))
    [9] => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }
    [10] => "bug"
    [11] => "fixed"
    [12] => "Yeaaaah"
    [13] => "a test"
)
Array
(
    [0] => "a"
    [123] => 123
    [$_POST["hello"]['world']] => array("is", "actually", "An array !")
    [made_up_index_3] => 1234
    [made_up_index_4] => 'got problem ?'
    ["a"] => $GlobalScopeVar
    [$test_further] => function test($noway){echo "this works too !!!";}
    ["yellow"] => "blue"
    ["b"] => array("nested"=>array(1,2,3), "nested"=>array(1,2,3),"nested"=>array(1,2,3))
    ["c"] => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }
    [made_up_index_10] => "bug"
    [made_up_index_11] => "fixed"
    ["mwahahahaa"] => "Yeaaaah"
    ["this is"] => "a test"
)

                                   Online regex demo                                     Online php demo

Known bug (fixed)

    $code='array("aaa", "sdsd" => "dsdsd");'; // fail
    $code='array(\'aaa\', \'sdsd\' => "dsdsd");'; // fail
    $code='array("aaa", \'sdsd\' => "dsdsd");'; // succeed
    // Which means, if a value with no keys is followed
    // by key => value and they are using the same quotation
    // then it will fail (first value gets merged with the key)

Online bug demo

Credits

Goes to Bart Kiers for his recursive pattern to match nested brackets.

Advice

You maybe should go with a parser since regexes are sensitive. @bwoebi has done a great job in his answer.

Community
  • 1
  • 1
HamZa
  • 14,671
  • 11
  • 54
  • 75
  • the OP doesn't want that sub arrays are parsed. Look at his initial example: `"b" => array("nested"=>array(1,2,3)),` should result in `"b" => 'array("nested"=>array(1,2,3))',`. – bwoebi Jun 16 '13 at 16:12
  • 2
    @bwoebi It doesn't get parsed, I have provided an extended example. – HamZa Jun 16 '13 at 16:15
  • 1
    I tested it thoroughly and it should work, don't hesitate to report a bug if you find one. – HamZa Jun 17 '13 at 00:28
  • @HamZa is it intentional that you only support a certain set of the language? `${"variable"}` for example will fail. – bwoebi Jun 17 '13 at 18:22
  • @HamZa It also fails when you want to use namespaces: `\a\b()` – bwoebi Jun 17 '13 at 18:28
  • 1
    @bwoebi intentional or not, regex is about matching certain patterns. If I didn't write it to match `${"variable"}` (for example) then it won't match. Basically I tried my best matching all *possible* cases, but since I'm not an advanced php coder, I forgot some cases (which I didn't even know). – HamZa Jun 17 '13 at 20:24
  • 1
    @HamZa check out all the possible language: http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_language_parser.y#282 – bwoebi Jun 17 '13 at 20:26
  • @Qtax Your regex fails or am I missing something [demo](http://regex101.com/r/eU2jR5) ? As mine works [great](http://regex101.com/r/bN6dW7). – HamZa Jun 18 '13 at 14:29
  • @HamZa: That is not how escapes work in PHP strings. The second example string is invalid. (The backslash escapes the backslash, the second double quote is unescaped and ends the string.) Thus my example works properly in this case, yours doesn't. Altho the quantifier used in my example should be `*` instead of `+` (if you want to match empty strings too): `"(?:[^\\"]|\\.)*+"` – Qtax Jun 18 '13 at 14:33
  • @Qtax Well atleast it will match even if there is errors in the code ? – HamZa Jun 18 '13 at 14:40
  • @HamZa, yes. And it matches wrong things when there are no errors in the code. -1. – Qtax Jun 18 '13 at 14:43
  • @Qtax "wrong things", can you elaborate ? If the code I'm trying to match has errors in it, and I'm trying to "bypass" these errors, is that bad ? Also if I changed my regex to what you have suggested it may completely break if there are errors in code. – HamZa Jun 18 '13 at 14:46
  • @Qtax My bad, it seems I copied the PHP code in the regex tester directly which is bad, note that in PHP I need to use \\\\ to match \ while in the tester I need to only use \\: [demo](http://regex101.com/r/lB7xO8). But like bwoebi has pointed out, there is a lot to do to improve this regex. Thanks ! – HamZa Jun 18 '13 at 15:00
  • @HamZa, which leads us to the original escape problem that I mentioned: http://regex101.com/r/zX5aX8 – Qtax Jun 18 '13 at 15:16
  • 1
    @HamZa your effort deserves at least 250, i know this RegEx can be improved but currently just work on many cases. (( i will give 250 to bwoebi too )) – AgelessEssence Jun 19 '13 at 01:03
21

Even when you asked for a regex, it works also with pure PHP. token_get_all is here the key function. For a regex check @HamZa's answer out.

The advantage here is that it is more dynamic than a regex. A regex has a static pattern, while with token_get_all, you can decide after every single token what to do. It even escapes single quotes and backslashes where necessary, what a regex wouldn't do.

Also, in regex, you have, even when commented, problems to imagine what it should do; what code does is much easier to understand when you look at PHP code.

$code = ' array(

  0  => "a",
  "a" => $GlobalScopeVar,
  "b" => array("nested"=>array(1,2,3)),  
  "c" => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; },
  "string_literal",
  12345

); ';

$token = token_get_all("<?php ".$code);
$newcode = "";

$i = 0;
while (++$i < count($token)) { // enter into array; then start.
        if (is_array($token[$i]))
                $newcode .= $token[$i][1];
        else
                $newcode .= $token[$i];

        if ($token[$i] == "(") {
                $ending = ")";
                break;
        }
        if ($token[$i] == "[") {
                $ending = "]";
                break;
        }
}

// init variables
$escape = 0;
$wait_for_non_whitespace = 0;
$parenthesis_count = 0;
$entry = "";

// main loop
while (++$i < count($token)) {
        // don't match commas in func($a, $b)
        if ($token[$i] == "(" || $token[$i] == "{") // ( -> normal parenthesis; { -> closures
                $parenthesis_count++;
        if ($token[$i] == ")" || $token[$i] == "}")
                $parenthesis_count--;

        // begin new string after T_DOUBLE_ARROW
        if (!$escape && $wait_for_non_whitespace && (!is_array($token[$i]) || $token[$i][0] != T_WHITESPACE)) {
                $escape = 1;
                $wait_for_non_whitespace = 0;
                $entry .= "'";
        }

        // here is a T_DOUBLE_ARROW, there will be a string after this
        if (is_array($token[$i]) && $token[$i][0] == T_DOUBLE_ARROW && !$escape) {
                $wait_for_non_whitespace = 1;
        }

        // entry ended: comma reached
        if (!$parenthesis_count && $token[$i] == "," || ($parenthesis_count == -1 && $token[$i] == ")" && $ending == ")") || ($ending == "]" && $token[$i] == "]")) {
                // go back to the first non-whitespace
                $whitespaces = "";
                if ($parenthesis_count == -1 || ($ending == "]" && $token[$i] == "]")) {
                        $cut_at = strlen($entry);
                        while ($cut_at && ord($entry[--$cut_at]) <= 0x20); // 0x20 == " "
                        $whitespaces = substr($entry, $cut_at + 1, strlen($entry));
                        $entry = substr($entry, 0, $cut_at + 1);
                }

                // $escape == true means: there was somewhere a T_DOUBLE_ARROW
                if ($escape) {
                        $escape = 0;
                        $newcode .= $entry."'";
                } else {
                        $newcode .= "'".addcslashes($entry, "'\\")."'";
                }

                $newcode .= $whitespaces.($parenthesis_count?")":(($ending == "]" && $token[$i] == "]")?"]":","));

                // reset
                $entry = "";
        } else {
                // add actual token to $entry
                if (is_array($token[$i])) {
                        $addChar = $token[$i][1];
                } else {
                        $addChar = $token[$i];
                }

                if ($entry == "" && $token[$i][0] == T_WHITESPACE) {
                        $newcode .= $addChar;
                } else {
                        $entry .= $escape?str_replace(array("'", "\\"), array("\\'", "\\\\"), $addChar):$addChar;
                }
        }
}

//append remaining chars like whitespaces or ;
$newcode .= $entry;

print $newcode;

Demo at: http://3v4l.org/qe4Q1

Should output:

array(

  0  => '"a"',
  "a" => '$GlobalScopeVar',
  "b" => 'array("nested"=>array(1,2,3))',  
  "c" => 'function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }',
  '"string_literal"',
  '12345'

) 

You can, to get the array's data, print_r(eval("return $newcode;")); to get the entries of the array:

Array
(
    [0] => "a"
    [a] => $GlobalScopeVar
    [b] => array("nested"=>array(1,2,3))
    [c] => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }
    [1] => "string_literal"
    [2] => 12345
)
Community
  • 1
  • 1
bwoebi
  • 23,637
  • 5
  • 58
  • 79
  • hey!!, i like the magic* you do here, i don't care if this is not a RegEx solution cuz it actually works, in fact, love the way you go beyond the required solution (( RegEx )) and give just what i need... i tested it and apparently there are no bugs*, so, congrats!! you awarded +250 for this beauty piece of code. (( yes, you should wait, cuz im waiting 23hours to give +250 points to Hamza first )) – AgelessEssence Jun 17 '13 at 00:32
  • 1
    @iim.hlk no, magic, just processing an array returned by token_get_all() ;-P I'll wait for the bounty ;-) – bwoebi Jun 17 '13 at 07:37
  • iam trying to start a new bunty of 250 to "pay to you what is yours" but only gives me the choice of 500 points :c – AgelessEssence Jun 19 '13 at 01:05
  • @iim.hlk didn't you know that, if you give another bounty on the same question, you have to double the bounty? – bwoebi Jun 19 '13 at 08:28
  • @iim.hlk now: what'll you do? give a bounty (I really wrote this code to get the bounty :|) or not? – bwoebi Jun 21 '13 at 09:24
  • sorry, i can't give 500, i didn't know i should double last bounty in the same question, mmm maybe i can bounty to another answered question by you?? (( let me know which )) – AgelessEssence Jun 21 '13 at 21:05
  • @iim.hlk If you want to, I'd consider bountying this answer, which was a lot of work too: http://stackoverflow.com/questions/16714107/malicious-php-file-content/16715495#16715495 – bwoebi Jun 21 '13 at 21:23
  • hey, this is a closed question, i can't bounty on this :| – AgelessEssence Jun 21 '13 at 21:27
  • @iim.hlk http://stackoverflow.com/questions/16375331/increment-on-tostring/16376690#16376690 this? – bwoebi Jun 22 '13 at 00:01
4

The clean way to do this is obviously to use the tokenizer (but keep in mind that the tokenizer alone doesn't solve the problem).

For the challenge, I purpose a regex approach.

The idea is not to describe the PHP syntax, but more to describe it in a negative way (in other words, I describe only basic and needed PHP structures to obtain the result). The advantage of this basic description is to deal with more complex objects than functions, strings, integers or booleans. The result is a more flexible pattern that can deal for example with multi/single line comments, heredoc/nowdoc syntaxes:

<pre><?php

$code=' array(
  0   => "a",
  "a" => $GlobalScopeVar,
  "b" => array("nested"=>array(1,2,3)),  
  "c" => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; },
); ';

$pattern = <<<'EOD'
~
# elements
(?(DEFINE)
    # comments
    (?<comMulti> /\* .*? (?:\*/|\z) )                                              # multiline comment
    (?<comInlin> (?://|\#) \N* $ )                                                 # inline comment
    (?<comments> \g<comMulti> | \g<comInlin> )

    # strings
    (?<strDQ> " (?>[^"\\]+|\\.)* ")                                                # double quote string
    (?<strSQ> ' (?>[^'\\]+|\\.)* ')                                                # single quote string
    (?<strHND> <<<(["']?)([a-zA-Z]\w*)\g{-2} (?>\R \N*)*? \R \g{-1} ;? (?=\R|$) )  # heredoc and nowdoc syntax
    (?<string> \g<strDQ> | \g<strSQ> | \g<strHND> )

    # brackets
    (?<braCrl> { (?> \g<nobracket> | \g<brackets> )* } )
    (?<braRnd> \( (?> \g<nobracket> | \g<brackets> )* \) )
    (?<braSqr> \[ (?> \g<nobracket> | \g<brackets> )* ] )
    (?<brackets> \g<braCrl> | \g<braRnd> | \g<braSqr> )

    # nobracket: content between brackets except other brackets
    (?<nobracket> (?> [^][)(}{"'</\#]+ | \g<comments> | / | \g<string> | <+ )+ )

    # ignored elements
    (?<s> \s+ | \g<comments> )
)

# array components
(?(DEFINE)    
    # key
    (?<key> [0-9]+ | \g<string> )

    # value
    (?<value> (?> [^][)(}{"'</\#,\s]+ | \g<s> | / | \g<string> | <+ | \g<brackets> )+? (?=\g<s>*[,)]) )
)
(?J)
(?: \G (?!\A)(?<!\)) | array \g<s>* \( ) \g<s>* \K

    (?: (?<key> \g<key> ) \g<s>* => \g<s>* )? (?<value> \g<value> ) \g<s>* (?:,|,?\g<s>*(?<stop> \) ))
~xsm
EOD;


if (preg_match_all($pattern, $code, $m, PREG_SET_ORDER)) {
    foreach($m as $v) {
        echo "\n<strong>Whole match:</strong> " . $v[0]
           . "\n<strong>Key</strong>:\t" . $v['key']
           . "\n<strong>Value</strong>:\t" . $v['value'] . "\n";
        if (isset($v['stop']))
            echo "\n<strong>done</strong>\n\n"; 

    }
}
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • hey bro, i have a little problem with this solution just need a little fix i think, can check it please : http://ideone.com/FKrPxS, maybe there is a problem with the empty($v['stop']) conditional, thanks for your time. – AgelessEssence Nov 25 '15 at 07:42
3

Here is what you asked for, very compact. Please let me know if you'd like any tweaks.

THE CODE (you can run this straight in php)

$code=' array(
  0  => "a",
 "a" => $GlobalScopeVar,
 "b" => array("nested"=>array(1,2,3)),  
 "c" => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; },

); ';

$regex = "~(?xm)
^[\s'\"]*([^'\"\s]+)['\"\s]*
=>\s*+
(.*?)\s*,?\s*$~";

if(preg_match_all($regex,$code,$matches,PREG_SET_ORDER)) {
    $array=array();
    foreach($matches as $match) {
        $array[$match[1]] = $match[2];
    }

    echo "<pre>";
    print_r($array);
    echo "</pre>";

} // END IF

THE OUTPUT

Array
(
    [0] => "a"
    [a] => $GlobalScopeVar
    [b] => array("nested"=>array(1,2,3))
    [c] => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }
)

$array contains your array.

You like?

Please let me know if you have any questions or require tweaks. :)

zx81
  • 41,100
  • 9
  • 89
  • 105
  • hey bro! i didn't see this answer until now, looks really good, in fact i always love compact the code (( do more with few lines )), currently i can't test this, but i really appreciate your effort, thanks :) – AgelessEssence Nov 27 '14 at 07:29
2

Just for this situation:

$code=' array(

  0=>"a",
  "a"=>$GlobalScopeVar,
  "b"=>array("nested"=>array(1,2,3)),  
  "c"=>function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; },

); ';

preg_match_all('#\s*(.*?)\s*=>\s*(.*?)\s*,?\s*$#m', $code, $m);
$array = array_combine($m[1], $m[2]);
print_r($array);

Output:

Array
(
    [0] => "a"
    ["a"] => $GlobalScopeVar
    ["b"] => array("nested"=>array(1,2,3))
    ["c"] => function() use (&$VAR) { return isset($VAR) ? "defined" : "undefined" ; }
)
HamZa
  • 14,671
  • 11
  • 54
  • 75
  • why you say "Just for this situation" ??, this regex can't extract any kind of properly structured array?? – AgelessEssence Jun 14 '13 at 22:36
  • 1
    @iim.hlk It will fail for something like [this](http://codepad.org/2TdgJwJr). Basically each "element" needs to be on a new line in order for this to succeed. – HamZa Jun 14 '13 at 22:40
  • Geez!, sounds dirty but i can offer 400 of my rep to get that magic waterproof invincible regex... think about it, with +6000 of rep you will catch ALL the ladies (( and ocasionally a JB )) :$ – AgelessEssence Jun 14 '13 at 22:55
  • 2
    @iim.hlk Sorry but regex doesn't know "waterproof", you'll need a parser. – HamZa Jun 14 '13 at 22:58