7

Before passing a string to eval() I would like to make sure the syntax is correct and allow:

  1. Two functions: a() and b()
  2. Four operators: /*-+
  3. Brackets: ()
  4. Numbers: 1.2, -1, 1

How can I do this, maybe it has something to do with PHP Tokenizer?

I'm actually trying to make a simple formula interpreter so a() and b() will be replaced by ln() and exp(). I don't want to write a tokenizer and parser from scratch.

hidarikani
  • 1,121
  • 1
  • 11
  • 25
  • Do you care about the order of those possible inputs? – hoppa Aug 08 '11 at 09:07
  • An example of the function you would allow and a function that should not pass should be added to your question. Note the use of eval should not be taken lightly. – Lawrence Cherone Aug 08 '11 at 09:11
  • 2
    Thats why he wants to sanitize beforehand I suspect ;) – hoppa Aug 08 '11 at 09:11
  • I'm trying to make a simple formula interpreter so a() and b() will be replaced by ln() and exp(). – hidarikani Aug 08 '11 at 09:13
  • Do you need to check syntax before eval'ing or are you just concerned that not other code is being executed? – hakre Aug 08 '11 at 09:17
  • The PHP script will exit if I eval invalid code and I don't want that. I would like to display errors to the user: 'No matching bracket, try again'. – hidarikani Aug 08 '11 at 09:30

5 Answers5

3

As far as validation is concerned, the following character tokens are valid:

operator: [/*+-]
funcs:    (a\(|b\()
brackets: [()]
numbers:  \d+(\.\d+)?
space:    [ ]

A simple validation could then check if the input string matches any combination of these patterns. Because the funcs token is pretty precise and it does not clash much with other tokens, this validation should be quite stable w/o the need implementing any syntax/grammar already:

$tokens = array(
    'operator' => '[/*+-]',
    'funcs' => '(a\(|b\()',
    'brackets' => '[()]', 
    'numbers' => '\d+(\.\d+)?',
    'space' => '[ ]',
);

$pattern = '';
foreach($tokens as $token)
{
    $pattern .= sprintf('|(?:%s)', $token);
}
$pattern = sprintf('~^(%s)*$~', ltrim($pattern, '|'));

echo $pattern;

Only if the whole input string matches against the token based pattern, it validates. It still might be syntactically wrong PHP, put you can ensure it only is build upon the specified tokens:

~^((?:[/*+-])|(?:(a\(|b\())|(?:[()])|(?:\d+(\.\d+)?)|(?:[ ]))*$~

If you build the pattern dynamically - as in the example - you're able to modify your language tokens later on more easily.

Additionally this can be the first step to your own tokenizer / lexer. The token stream can then passed on to a parser which can syntactically validate and interpret it. That's the part user187291 wrote about.

Alternatively to writing a full lexer+parser, and you need to validate the syntax, you can formulate your grammar based on tokens as well and then do a regex based token grammar on the token representation of the input.

The tokens are the words you use in your grammar. You will need to describe parenthesis and function definition more precisely then in tokens, and the tokenizer should follow more clear rules which token supersedes another token. The concept is outlined in another question of mine. It uses regex as well for grammar formulation and syntax validation, but it still does not parse. In your case eval would be the parser you're making use of.

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • [Simulate php array language construct or parse with regexp?](http://stackoverflow.com/questions/3267951/simulate-php-array-language-construct-or-parse-with-regexp/3268443#3268443) – hakre Jan 05 '12 at 23:24
2

Parser generators have indeed already been written for PHP, and "LIME" in particular comes with the typical "calculator" example, which would be an obvious starting point for your "mini language": http://sourceforge.net/projects/lime-php/

It's been years since I last played with LIME, but it was already mature & stable then.

Notes:

1) Using a full-on parser generator gives you the advantage of avoiding PHP eval() entirely if you wish - you can make LIME emit a parser which effectively provides an "eval" function for expressions written in your mini language (with validation baked in). This gives you the additional advantage of allowing you to add support for new functions, as needed.

2) It may seem like overkill at first to use a parser generator for such an apparently small task, but once you get the examples working you'll be impressed by how easy it is to modify and extend them. And it's very easy to underestimate the difficulty of writing a bug-free parser (even a "trivial" one) from scratch.

Peter
  • 2,526
  • 1
  • 23
  • 32
  • @stereofrog - Sorry, I misread the change history. The link should at least work, now, although I know it doesn't match the preferred format (for some reason bracketed links aren't being rendered properly for me in Chrome). – Peter Aug 08 '11 at 10:35
0

yes, you need the Tokenizer, or something similar, but it's only part of the story. A tokenizer (more commonly called "lexer") can only read and parse elements of an expression, but has no means to detect that something like "foo()+*bar)" is invalid. You need the second part, called parser which would be able to arrange tokens in a kind of a tree (called "AST") or provide an error message when failing to do so. Ironically, once you've got a tree, "eval" is not needed anymore, you can evaluate your expression directly from the tree.

I would recommend you to write a parser by hand because it's a very useful exercise and a lot of fun. Recursive descent parsers are quite easy to program.

user187291
  • 53,363
  • 19
  • 95
  • 127
  • I agree that writing a parser by hand is fun, and is part of basic training for any serious programmer. But if you're working "on the clock" it would be better IMO to re-use a pre-exiting parser generator and spend some of the time saved learning about (and playing with) grammar definitions. – Peter Aug 08 '11 at 09:44
  • I see your point, but I disagree in the case of parsers in particular. First, even if you're interested in learning how stuff works you are likely to make a few standard mistakes along the road to building a parser which is robust enough to determine whether a given string from the Big Bad Web may be safely passed to PHP eval() - and although we dislike ignorance we also dislike vulnerable web applications. Second, I don't think it's possible to use a parser generator (even superficially) without learning something new and useful, even if you ultimately decide on another approach. – Peter Aug 08 '11 at 10:28
0

You could use token_get_all(), inspect each token, and abort at the first invalid token.

rid
  • 61,078
  • 31
  • 152
  • 193
0

hakre's answer, using regexes is a nice solution, but is a wee bit complicated. Also handling a whitelist of functions becomes rather messy. And if this does go wrong it could have a very nasty effect on your system.

Is there a reason you don't use the javascript 'eval' instead?

symcbean
  • 47,736
  • 6
  • 59
  • 94