4

From an external source I'm getting strings like

array(1,2,3)

but also a larger arrays like

array("a", "b", "c", array("1", "2", array("A", "B")), array("3", "4"), "d")

I need them to be an actual array in php. I know I could use eval but since it are untrusted sources I'd rather not do that. I also have no control of the external sources.

Should I use some regular expressions for this (if so, what) or is there some other way?

kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
Nin
  • 2,960
  • 21
  • 30
  • Your external source is giving you a string like this "array(1,2,3)" and you want to turn that text into a php array? – Jonathan Mayhak Jul 16 '10 at 18:52
  • This is going to be though... That's not a serialization format PHP recognizes. – Artefacto Jul 16 '10 at 18:53
  • Can you control the external source? Is it possible to ask them to generate JSON or XML instead? – kennytm Jul 16 '10 at 18:54
  • @jonathan: Yes I want that to be put in a PHP array (just like you would get with eval() but for security reasons don't want to use eval. @KennyTM: I don't have any control over the external source, so I have to work with this. – Nin Jul 16 '10 at 18:58

3 Answers3

11

Whilst writing a parser using the Tokenizer which turned out not as easy as I expected, I came up with another idea: Why not parse the array using eval, but first validate that it contains nothing harmful?

So, what the code does: It checks the tokens of the array against some allowed tokens and chars and then executes eval. I do hope I included all possible harmless tokens, if not, simply add them. (I intentionally didn't include HEREDOC and NOWDOC, because I think they are unlikely to be used.)

function parseArray($code) {
    $allowedTokens = array(
        T_ARRAY                    => true,
        T_CONSTANT_ENCAPSED_STRING => true,
        T_LNUMBER                  => true,
        T_DNUMBER                  => true,
        T_DOUBLE_ARROW             => true,
        T_WHITESPACE               => true,
    );
    $allowedChars = array(
        '('                        => true,
        ')'                        => true,
        ','                        => true,
    );

    $tokens = token_get_all('<?php '.$code);
    array_shift($tokens); // remove opening php tag

    foreach ($tokens as $token) {
        // char token
        if (is_string($token)) {
            if (!isset($allowedChars[$token])) {
                throw new Exception('Disallowed token \''.$token.'\' encountered.');
            }
            continue;
        }

        // array token

        // true, false and null are okay, too
        if ($token[0] == T_STRING && ($token[1] == 'true' || $token[1] == 'false' || $token[1] == 'null')) {
            continue;
        }

        if (!isset($allowedTokens[$token[0]])) {
            throw new Exception('Disallowed token \''.token_name($token[0]).'\' encountered.');
        }
    }

    // fetch error messages
    ob_start();
    if (false === eval('$returnArray = '.$code.';')) {
        throw new Exception('Array couldn\'t be eval()\'d: '.ob_get_clean());
    }
    else {
        ob_end_clean();
        return $returnArray;
    }
}

var_dump(parseArray('array("a", "b", "c", array("1", "2", array("A", "B")), array("3", "4"), "d")'));

I think this is a good comprimise between security and convenience - no need to parse yourself.

For example

parseArray('exec("haha -i -thought -i -was -smart")');

would throw exception:

Disallowed token 'T_STRING' encountered.
NikiC
  • 100,734
  • 37
  • 191
  • 225
  • I was having the same thought :) I haven't given up on the idea of making it entirly with the tokeniser though, but I'll explore your script first., thanks – Nin Jul 16 '10 at 20:06
6

You could do:

json_decode(str_replace(array('array(', ')'), array('[', ']'), $string)));

Replace the array with square brackets. Then json_decode. If the string is just a multidimensional array with scalar values in it, then doing the str_replace will not break anything and you can json_decode it. If it contains any code, it will also replace the function brackets and then the Json won't be valid and NULL is returned.

Granted, that's a rather, umm, creative approach, but might work for you.

Edit: Also, see the comments for some further limitiations pointed out by other users.

Gordon
  • 312,688
  • 75
  • 539
  • 559
  • I can't test this right now, but that is an elegant solution if it returns properly. +1 – TCCV Jul 16 '10 at 19:17
  • @KennyTM yeah that wouldnt work. I'll leave it up there nonetheless, so the OP can decide if it's of any use – Gordon Jul 16 '10 at 19:18
  • And this is for arrays only. Won't work on associative arrays. – NikiC Jul 16 '10 at 19:20
  • 1
    It's creative, but I like it and it might just do the trick. It will be pretty simple arrays anyway. – Nin Jul 16 '10 at 19:20
2

I think you should use the Tokenizer for this. Maybe I will write a script lateron, that actually does it.

NikiC
  • 100,734
  • 37
  • 191
  • 225