4

My input is quite simple:

$input = '( ( "M" AND ( "(" OR "AND" ) ) OR "T" )';

where ( starts a new node on tree and ) ends it. AND and OR words are reserved for boolean operation so until they are not inside "" marks, they have a special meaning. In my DSL AND and OR clauses alter By node level so that there can be only either AND or OR clauses at level. If AND comes after OR, it will start a new subnode. All characters inside "" should be regarded as they are. Finally " could be escaped with \" as usual.

What is a good way to make translate sentence which look like this in PHP:

$output = array(array(array("M" , array("(", "AND")) , "T"), FALSE);

Note that FALSE is an indicator, that the root level had OR keyword. If input was:

( ( "M" AND ( "(" OR "AND" ) ) AND "T" )

then output would be:

$output = array(array(array("M", array("(", "AND")), "T"), TRUE);

It is tempting to use replace('(', 'array('); and eval code, but then escaping characters and wrapping literals would become an issue.

At the moment I'm not implementing NOT boolean operator on DSL.

Thanks for any help. JavaSript code are ok too.

Python example:

I made some tests with Python before going to PHP and Javascript. What I did was:

  1. find string literals with regex
  2. replace literals with generated keys
  3. store literals to assoc list
  4. split input to single level list by parenthesis
  5. find root level boolean operator
  6. get rid of boolean operators and white space
  7. replace literal keys with stored values

It might work but I'm sure there must be much more sophisticated way to do it.

http://codepad.org/PdgQLviI

MarkokraM
  • 980
  • 1
  • 12
  • 26
  • What indicates `OR` keyword for nested level? `( "(" AND "AND" )` instead of `( "(" OR "AND" )` in Your example. – shudder Jan 19 '17 at 20:27
  • It is indicated by upper level, which is indicated by root level. So I actually need AND/OR clause for root level only. Other levels swap clause back and forth. I hope I understood you correctly... – MarkokraM Jan 19 '17 at 20:32
  • It means AND/OR clauses are optional on any other than root level. Just to make everything more readable. – MarkokraM Jan 19 '17 at 20:34
  • 1
    What you want to build is a parser. See my SO answers on how to build recursive descent parsers, which are ideal for parsing simple expressions of this kind. http://stackoverflow.com/questions/2245962/is-there-an-alternative-for-flex-bison-that-is-usable-on-8-bit-embedded-systems/2336769#2336769 – Ira Baxter Jan 19 '17 at 20:47

1 Answers1

1

Here's my side-project's library modification. It should handle these kind of strings - perform some stress tests and let me know if it breaks somewhere.

Tokenizer type class is needed to extract and tokenize variables, so they don't interfere with syntax parsing and tokenize parethesis so they could be matched directly (lazy-evaluated content wouldn't catch nested level and greedy would cover all contexts on the same level). It also has some keyword syntax (a little more than needed, since it will be parsed only for root level). Throws InvalidArgumentException when trying to access variables registry with wrong key, and RuntimeException when parenthesis don't match.

class TokenizedInput
{
    const VAR_REGEXP         = '\"(?P<string>.*?)\"';
    const BLOCK_OPEN_REGEXP  = '\(';
    const BLOCK_CLOSE_REGEXP = '\)';
    const KEYWORD_REGEXP     = '(?<keyword>OR|AND)';

    // Token: <TOKEN_DELIM_LEFT><TYPE_TOKEN><ID_DELIM>$id<TOKEN_DELIM_RIGHT>
    const TOKEN_DELIM_LEFT  = '<';
    const TOKEN_DELIM_RIGHT = '>';

    const VAR_TOKEN         = 'VAR';
    const KEYWORD_TOKEN     = 'KEYWORD';
    const BLOCK_OPEN_TOKEN  = 'BLOCK';
    const BLOCK_CLOSE_TOKEN = 'ENDBLOCK';

    const ID_DELIM  = ':';
    const ID_REGEXP = '[0-9]+';

    private $original;
    private $tokenized;
    private $data = [];

    private $blockLevel = 0;
    private $varTokenId = 0;

    protected $procedure = [
        'varTokens'    => self::VAR_REGEXP,
        'keywordToken' => self::KEYWORD_REGEXP,
        'blockTokens'  => '(?P<open>' . self::BLOCK_OPEN_REGEXP . ')|(?P<close>' . self::BLOCK_CLOSE_REGEXP . ')'
    ];

    private $tokenMatch;

    public function __construct($input) {
        $this->original = (string) $input;
    }

    public function string() {
        isset($this->tokenized) or $this->tokenize();
        return $this->tokenized;
    }

    public function variable($key) {
        isset($this->tokenized) or $this->tokenize();
        if (!isset($this->data[$key])) {
            throw new InvalidArgumentException("Variable id:($key) does not exist.");
        }
        return $this->data[$key];
    }

    public function tokenSearchRegexp() {
        if (!isset($this->tokenMatch)) {
            $strings  = $this->stringSearchRegexp();
            $blocks   = $this->blockSearchRegexp();
            $this->tokenMatch = '#(?:' . $strings . '|' . $blocks . ')#';
        }
        return $this->tokenMatch;
    }

    public function stringSearchRegexp($id = null) {
        $id = $id ?: self::ID_REGEXP;
        return preg_quote(self::TOKEN_DELIM_LEFT . self::VAR_TOKEN . self::ID_DELIM)
            . '(?P<id>' . $id . ')'
            . preg_quote(self::TOKEN_DELIM_RIGHT);
    }

    public function blockSearchRegexp($level = null) {
        $level = $level ?: self::ID_REGEXP;
        $block_open = preg_quote(self::TOKEN_DELIM_LEFT . self::BLOCK_OPEN_TOKEN . self::ID_DELIM)
            . '(?P<level>' . $level . ')'
            . preg_quote(self::TOKEN_DELIM_RIGHT);
        $block_close = preg_quote(self::TOKEN_DELIM_LEFT . self::BLOCK_CLOSE_TOKEN . self::ID_DELIM)
            . '\k<level>'
            . preg_quote(self::TOKEN_DELIM_RIGHT);
        return $block_open . '(?P<contents>.*)' . $block_close;
    }

    public function keywordSearchRegexp($keyword = null) {
        $keyword = $keyword ? '(?P<keyword>' . $keyword . ')' : self::KEYWORD_REGEXP;
        return preg_quote(self::TOKEN_DELIM_LEFT . self::KEYWORD_TOKEN . self::ID_DELIM)
            . $keyword
            . preg_quote(self::TOKEN_DELIM_RIGHT);
    }

    private function tokenize() {
        $current = $this->original;
        foreach ($this->procedure as $method => $pattern) {
            $current = preg_replace_callback('#(?:' . $pattern . ')#', [$this, $method], $current);
        }

        if ($this->blockLevel) {
            throw new RuntimeException("Syntax error. Parenthesis mismatch." . $this->blockLevel);
        }

        $this->tokenized = $current;
    }

    protected function blockTokens($match) {
        if (isset($match['close'])) {
            $token = self::BLOCK_CLOSE_TOKEN . self::ID_DELIM . --$this->blockLevel;
        } else {
            $token = self::BLOCK_OPEN_TOKEN . self::ID_DELIM . $this->blockLevel++;

        }

        return $this->addDelimiters($token);
    }

    protected function varTokens($match) {
        $this->data[$this->varTokenId] = $match[1];
        return $this->addDelimiters(self::VAR_TOKEN . self::ID_DELIM . $this->varTokenId++);
    }

    protected function keywordToken($match) {
        return $this->addDelimiters(self::KEYWORD_TOKEN . self::ID_DELIM . $match[1]);
    }

    private function addDelimiters($token) {
        return self::TOKEN_DELIM_LEFT . $token . self::TOKEN_DELIM_RIGHT;
    }
}

Parser type class performs matching on tokenized string - pulls out registered variables and goes recursively into nested contexts by clonig itself. Operator type handling is unusual, which makes it more of a derived class, but it's hard to achieve satysfying abstraction in Parsers' world anyway.

class ParsedInput
{
    private $input;
    private $result;
    private $context;

    public function __construct(TokenizedInput $input) {
        $this->input = $input;
    }

    public function result() {
        if (isset($this->result)) { return $this->result; }

        $this->parse($this->input->string());
        $this->addOperator();

        return $this->result;
    }

    private function parse($string, $context = 'root') {
        $this->context = $context;
        preg_replace_callback(
            $this->input->tokenSearchRegexp(),
            [$this, 'buildStructure'],
            $string
        );

        return $this->result;
    }

    protected function buildStructure($match) {
        if (isset($match['contents'])) { $this->parseBlock($match['contents'], $match['level']); }
        elseif (isset($match['id'])) { $this->parseVar($match['id']); }
    }

    protected function parseVar($id) {
        $this->result[] = $this->input->variable((int) $id);
    }

    protected function parseBlock($contents, $level) {
        $nested = clone $this;
        $this->result[] = $nested->parse($contents, (int) $level);
    }

    protected function addOperator() {
        $subBlocks = '#' . $this->input->blockSearchRegexp(1) . '#';
        $rootLevel = preg_replace($subBlocks, '', $this->input->string());
        $rootKeyword = '#' . $this->input->keywordSearchRegexp('AND') . '#';
        return $this->result[] = (preg_match($rootKeyword, $rootLevel) === 1);
    }

    public function __clone() {
        $this->result = [];
    }
}

Example usage:

$input = '( ( "M" AND ( "(" OR "AND" ) ) AND "T" )';

$tokenized = new TokenizedInput($input);
$parsed = new ParsedInput($tokenized);

$result = $parsed->result();

I removed namespaces/imports/intrefaces, so you might adjust'em as you need. Also didn't want to dig through (possibly invalid now) comments, so removed them as well.

shudder
  • 2,076
  • 2
  • 20
  • 21
  • Clone method on parseBlock and maybe result method on ParsedInput does not seem to work. My environment: PHP 5.6.25 (cli) (built: Aug 18 2016 11:39:15) Copyright (c) 1997-2016 The PHP Group Zend Engine v2.6.0, Copyright (c) 1998-2016 Zend Technologies On first case file doesnt compile if i dont change to clone $this->... instead of (clone $this)-> – MarkokraM Jan 20 '17 at 16:23
  • if blockSearchRegexp in addOperator is called without level argument, then result is better, you may know the reason? – MarkokraM Jan 20 '17 at 16:37
  • 1
    @MarkokraM Some php7 flavours in the code (1.operator precedence - `clone` error, 2.primitive typehints - in `blockSearchRegexp()` and others). Should be ok now. – shudder Jan 20 '17 at 16:45