2

I would like to parse an ASCIIMath expression and convert it to MathML. The result is NOT used for display on a webpage, though, so I cannot use MathJax or the native parser(they are in js). The language I am using is PHP. There is an ASCIIMathMLPHP but it is outdated, and does not completely suit my needs either; for instance, no support for the absolute value symbol.

I have absolutely no experience in parsing (and RegEx), so I may have started in a wrong path in the first place. My approach is to use a RegEx to extract the tokens of the expression, loop through them and continue to break them down using the same RegEx until I cannot, and finally append these smallest tokens to a DOMNode, to a DOMDocument eventually and output it. I hope it sounds good so far.

Questions

  1. conceptual: If I use the RegEx produced by get_regex_for( SIMPLE_EXPRESSION ) and get_regex_for( INTERMEDIATE_EXPRESSION ), I will get recursive call could loop indefinitely at compile time, but not with get_regex_for( EXPRESSION ). Why is that?
  2. practical: If I use the RegEx produced by get_regex_for( EXPRESSION ), it matches almost everything. However, I can't seem to modify the expression such that capturing groups properly catches all tokens I want. Is there a way?
  3. practical: So far I found that, the above RegEx does not match grouping brackets with no content, but simply adding a ? or ?+ after the . seems to cause catastrophic backtracking. I am aware of atomic groups but am not sure if to apply them here. Any suggestion? SOLVED I shouldn't have added ? to . because I don't want to match any zero-length content anywhere else: it would always succeed to match zero-length which will lead to infinite recursions. Instead, I should change lEr to lE?r. Changes now reflected in code below.
  4. conceptual: Continuation of Q1: If I prepend (?&S)| to get_regex_for( EXPRESSION ), I won't get any error. Then why is Q1 happening? Furthermore, If I prepend (?&I)(\/(?&I))?| , tokens seem to match, but doesn't it look redundant because it is basically what E represents, if you take a look at the grammar below?
  5. practical: Right now parsing unbalanced brackets is slow. Is there a way to avoid this slowness?
  6. practical: What is a better approach that could save a lot of time than to figure these things out?

My whole class is here for reference. It is far from complete (especially the symbol list and callbacks) but let's just focus on the approach (other things are also welcome! ):

<?php
// element tags
define( 'IDENTIFIER', 'mi' );
define( 'NUMBER', 'mn' );
define( 'OPERATOR', 'mo' );
define( 'SQUARE_ROOT', 'msqrt' );
define( 'TEXT', 'mtext' );
define( 'STYLE', 'mstyle' );
define( 'SPACE', 'mspace' );

// expression tags
define( 'FRACTION', 'mfrac' );
define( 'ROOT', 'mroot' );
define( 'SUBSCRIPT', 'msub' );
define( 'SUPERSCRIPT', 'msup' );
define( 'SUB_SUPERSCRIPT', 'msubsup' );

// format tags
define( 'OVER', 'mover' );
define( 'UNDER', 'munder' );
define( 'UNDER_OVER', 'munderover' );

define( 'FENCED', 'mfenced' );

// parameter specifier
define( 'HEX', 0 );
define( 'AS_IS', 1 );
define( 'HIDDEN', 2 );
define( 'ATTRIBUTE', 3 );
define( 'SYMBOL', 4 );
define( 'OUTER', 5 );

define( 'TO', 0 );

define( 'CONSTANT', 0 );
define( 'UNARY', 1 );
define( 'BINARY', 2 );
define( 'LEFT', 3 );
define( 'RIGHT', 4 );
define( 'SIMPLE_EXPRESSION', 5 );
define( 'INTERMEDIATE_EXPRESSION', 6 );
define( 'EXPRESSION', 7 );

class Simple_Ascii_Math_Parser {
    private $mathml;
    private $math;

    private function __construct() {
        $this->mathml = new DOMDocument;
        $this->mathml->formatOutput = true;
        $this->create_math_element();
    }
    private function create_math_element() {
        $this->math = $this->mathml->createElement( 'math' );
        $this->math->setAttribute( 'xmlns', 'http://www.w3.org/1998/Math/MathML' );
        $this->mathml->appendChild( $this->math );
    }
    public static function get_regex_for( $type, &$defined = array() ) {
        // contains intentional assignment in ternary operators
        !empty( $defined ) or $defined = array_fill( 0, 8, false );
        switch( $type ) {
            case CONSTANT:
                return ( !$defined[ CONSTANT ] and $defined[ CONSTANT ] = true )? 
                    sprintf( '(?P<V>%s|%s|%s)', '(?:[0-9]*+\.)?[0-9]+', self::get_regex_of( self::$__CONSTANT ), 
                        sprintf( '(?!%s|%s|\\|).', self::get_regex_for( LEFT, $defined ), self::get_regex_for( RIGHT, $defined ) )
                    ) : '(?&V)';
            break;
            case UNARY:
                return ( !$defined[ UNARY ] and $defined[ UNARY ] = true )?
                    sprintf( '(?P<U>%s|%s)', self::get_regex_of( self::$__UNARY ), self::get_regex_of( self::$__SPECIAL_UNARY_FUNC ) ) : '(?&U)';
            break;
            case BINARY:
                return ( !$defined[ BINARY ] and $defined[ BINARY ] = true )?
                    sprintf( '(?P<B>%s)', self::get_regex_of( self::$__BINARY ) ) : '(?&B)';
            break;
            case LEFT:
                return ( !$defined[ LEFT ] and $defined[ LEFT ] = true )?
                    sprintf( '(?P<L>%s)', self::get_regex_of( self::$__GROUPING_BRACKETS_LEFT ) ) : '(?&L)';
            break;
            case RIGHT:
                return ( !$defined[ RIGHT ] and $defined[ RIGHT ] = true )?
                    sprintf( '(?P<R>%s)', self::get_regex_of( self::$__GROUPING_BRACKETS_RIGHT ) ) : '(?&R)';
            break;
            case SIMPLE_EXPRESSION:
                return ( !$defined[ SIMPLE_EXPRESSION ] and $defined[ SIMPLE_EXPRESSION ] = true )?
                    sprintf( '(?P<S>%s|%s|%s|%s|%s)', 
                        sprintf( '%s%s', self::get_regex_for( UNARY, $defined ), self::get_regex_for( SIMPLE_EXPRESSION, $defined ) ),
                        sprintf( '%s%s{2}', self::get_regex_for( BINARY, $defined ), self::get_regex_for( SIMPLE_EXPRESSION, $defined ) ),
                        sprintf( '%s%s?%s', self::get_regex_for( LEFT, $defined ), self::get_regex_for( EXPRESSION, $defined ), self::get_regex_for( RIGHT, $defined ) ),
                        sprintf( preg_quote('|%s|', '/'), self::get_regex_for( EXPRESSION, $defined ) . '?' ),
                        sprintf( '%s', self::get_regex_for( CONSTANT, $defined ) )
                    ) : '(?&S)';
            break;
            case INTERMEDIATE_EXPRESSION:
                return ( !$defined[ INTERMEDIATE_EXPRESSION ] and $defined[ INTERMEDIATE_EXPRESSION ] = true )?
                    sprintf( '(?P<I>(%s)(?:_(%s))?(?:\\^(%s))?)', 
                            self::get_regex_for( SIMPLE_EXPRESSION, $defined ), self::get_regex_for( SIMPLE_EXPRESSION, $defined ), self::get_regex_for( SIMPLE_EXPRESSION, $defined )
                    ) : '(?&I)';
            break;
            case EXPRESSION:
                return ( !$defined[ EXPRESSION ] and $defined[ EXPRESSION ] = true )?
                    sprintf( '(?P<E>%s(?:(?&E)|\\/(?&I))?)', 
                        self::get_regex_for( INTERMEDIATE_EXPRESSION, $defined )
                    ) : '(?&E)';
            break;
        }
    }
    public static function get_regex_of( $array ) {
        $result = '';
        $keys = array_keys( $array );
        // Longer key comes first
        usort( $keys, function( $e1, $e2 ){ return strlen( $e2 ) - strlen( $e1 ); });
        foreach($keys as $key ) {
            $result .= '|' . preg_quote( $key, '/' ) ;
        }
        return substr( $result, 1 );
    }

    private static $__CONSTANT = array(
        // lowercase greek symbols
        'alpha' => array( 'syntax' => 'alpha', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03B1' ) ),
        'beta' => array( 'syntax' => 'beta', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03B2' ) ),
        'gamma' => array( 'syntax' => 'gamma', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03B3' ) ),
        'delta' => array( 'syntax' => 'delta', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03B4' ) ),
        'epsi' => array( 'syntax' => 'epsi', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03B5' ) ),
        'epsilon' => array( 'syntax' => 'epsilon', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03B5' ) ),
        'zeta' => array( 'syntax' => 'zeta', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03B6' ) ),
        'eta' => array( 'syntax' => 'eta', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03B7' ) ),
        'theta' => array( 'syntax' => 'theta', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03B8' ) ),
        'iota' => array( 'syntax' => 'iota', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03B9' ) ),
        'kappa' => array( 'syntax' => 'kappa', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03BA' ) ),
        'lambda' => array( 'syntax' => 'lambda', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03BB' ) ),
        'mu' => array( 'syntax' => 'mu', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03BC' ) ),
        'nu' => array( 'syntax' => 'nu', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03BD' ) ),
        'xi' => array( 'syntax' => 'xi', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03BE' ) ),
        // hex = 03BF : omicron is not supported, use letter o instead...
        'pi' => array( 'syntax' => 'pi', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03C0' ) ),
        'rho' => array( 'syntax' => 'rho', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03C1' ) ),
        // hex = 03C2 : final sigma is not supported...
        'sigma' => array( 'syntax' => 'sigma', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03C3' ) ),
        'tau' => array( 'syntax' => 'tau', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03C4' ) ),
        'upsilon' => array( 'syntax' => 'upsilon', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03C5' ) ),
        'phi' => array( 'syntax' => 'phi', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03C6' ) ),
        'chi' => array( 'syntax' => 'chi', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03C7' ) ),
        'psi' => array( 'syntax' => 'psi', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03C8' ) ),
        'omega' => array( 'syntax' => 'omega', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03C9' ) ),

        // and their variations
        'varepsilon' => array( 'syntax' => 'varepsilon', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '025B' ) ),
        'vartheta' => array( 'syntax' => 'vartheta', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03D1' ) ),
        'varphi' => array( 'syntax' => 'varphi', 'callback' => 'output', 'args' => array( IDENTIFIER, HEX, '03D5' ) ),

        // uppercase greek symbols
        // note: uppercases are treated as operators
        'Gamma' => array( 'syntax' => 'Gamma', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '0393' ) ),
        'Delta' => array( 'syntax' => 'Delta', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '0394' ) ),
        'Theta' => array( 'syntax' => 'Theta', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '0398' ) ),
        'Lambda' => array( 'syntax' => 'Lambda', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '039B' ) ),
        'Xi' => array( 'syntax' => 'Xi', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '039E' ) ),
        'Pi' => array( 'syntax' => 'Pi', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '03A0' ) ),
        'Sigma' => array( 'syntax' => 'Sigma', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '03A3' ) ),
        'Phi' => array( 'syntax' => 'Phi', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '03A6' ) ),
        'Psi' => array( 'syntax' => 'Psi', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '03A8' ) ),
        'Omega' => array( 'syntax' => 'Omega', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '03A9' ) ),

        // constants
        // operators
        '+' => array( 'syntax' => '+', 'callback' => 'output', 'args' => array( OPERATOR, AS_IS ) ),
        '-' => array( 'syntax' => '-', 'callback' => 'output', 'args' => array( OPERATOR, AS_IS ) ),
        '+-' => array( 'syntax' => '+-', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '00B1' ) ),
        '*' => array( 'syntax' => '*', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '22C5' ) ),
        '**' => array( 'syntax' => '**', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '22C6' ) ),
        '//' => array( 'syntax' => '//', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '002F' ) ),
        '\\' => array( 'syntax' => '\\', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '005C' ) ),
        'xx' => array( 'syntax' => 'xx', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '00D7' ) ),
        '-:' => array( 'syntax' => '-:', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '00F7' ) ),
        '@' => array( 'syntax' => '@', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2218' ) ),
        'o+' => array( 'syntax' => 'o+', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2295' ) ),
        'ox' => array( 'syntax' => 'ox', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2297' ) ),
        'o.' => array( 'syntax' => 'o.', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2299' ) ),
        // relation symbols
        '=' => array( 'syntax' => '=', 'callback' => 'output', 'args' => array( OPERATOR, AS_IS ) ),
        '<' => array( 'syntax' => '<', 'callback' => 'output', 'args' => array( OPERATOR, AS_IS ) ),
        '>' => array( 'syntax' => '>', 'callback' => 'output', 'args' => array( OPERATOR, AS_IS ) ),
        '<=' => array( 'syntax' => '<=', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2264' ) ),
        '>=' => array( 'syntax' => '<=', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2265' ) ),
        '!=' => array( 'syntax' => '!=', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2260' ) ),
        '-<' => array( 'syntax' => '-<', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '227A' ) ),
        '>-' => array( 'syntax' => '>-', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '227B' ) ),
        '-=' => array( 'syntax' => '-=', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2261' ) ),
        '~=' => array( 'syntax' => '~=', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2245' ) ),
        '~~' => array( 'syntax' => '~~', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2248' ) ),
        'prop' => array( 'syntax' => 'prop', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '221D' ) ),

        // misc. symbols
        'O/' => array( 'syntax' => 'O/', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2205' ) ),
        'oo' => array( 'syntax' => 'oo', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '221E' ) ),
        'aleph' => array( 'syntax' => 'aleph', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2135' ) ),
        '/_' => array( 'syntax' => '/_', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2220' ) ),
        ':.' => array( 'syntax' => ':.', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2234' ) ),
        'diamond' => array( 'syntax' => 'diamond', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '22C4' ) ),
        'square' => array( 'syntax' => 'square', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '25A1' ) ),
        '\\ ' => array( 'syntax' => '\\ ', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '00A0' ) ),

        // dots
        'cdots' => array( 'syntax' => 'cdots', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '22EF' ) ),
        'vdots' => array( 'syntax' => 'vdots', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '22EE' ) ),
        'ddots' => array( 'syntax' => 'ddots', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '22F1' ) ),

        // sets
        'uu' => array( 'syntax' => 'uu', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '222A' ) ),
        'nn' => array( 'syntax' => 'nn', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2229' ) ),
        'vv' => array( 'syntax' => 'vv', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2228' ) ),
        '^^' => array( 'syntax' => '^^', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2227' ) ),
        'in' => array( 'syntax' => 'in', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2208' ) ),
        '!in' => array( 'syntax' => '!in', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2209' ) ),
        'sub' => array( 'syntax' => 'sub', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2282' ) ),
        'sup' => array( 'syntax' => 'sup', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2283' ) ),
        'sube' => array( 'syntax' => 'sube', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2286' ) ),
        'supe' => array( 'syntax' => 'supe', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2287' ) ),

        // brackets
        '||' => array( 'syntax' => '||', 'callback' => 'output', 'args' => array( FENCED, HEX, '2016' ) ),
        '|__' => array( 'syntax' => '|__', 'callback' => 'output', 'args' => array( FENCED, HEX, '230A' ) ),
        '__|' => array( 'syntax' => '__|', 'callback' => 'output', 'args' => array( FENCED, HEX, '230B' ) ),
        '|~' => array( 'syntax' => '|~', 'callback' => 'output', 'args' => array( FENCED, HEX, '2308' ) ),
        '~|' => array( 'syntax' => '~|', 'callback' => 'output', 'args' => array( FENCED, HEX, '2309' ) ),

        // logical symbols
        'not' => array( 'syntax' => 'not', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '00AC' ) ),
        '=>' => array( 'syntax' => '=>', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '21D2' ) ),

        '<=>' => array( 'syntax' => '<=>', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '21D4' ) ),

        'AA' => array( 'syntax' => 'AA', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2200' ) ),
        'EE' => array( 'syntax' => 'EE', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2203' ) ),
        '_|_' => array( 'syntax' => '_|_', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '27C2' ) ),
        'TT' => array( 'syntax' => 'TT', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '22A4' ) ),
        '|--' => array( 'syntax' => '|--', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '22A2' ) ),
        '|==' => array( 'syntax' => '|==', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '22A8' ) ),

        // arrows
        'uarr' => array( 'syntax' => 'uarr', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2191' ) ),
        '->' => array( 'syntax' => '->', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2192' ) ),
        'rarr' => array( 'syntax' => 'rarr', 'callback' => 'transform', 'args' => array( TO, '->' ) ),
        'darr' => array( 'syntax' => 'darr', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2193' ) ),
        'larr' => array( 'syntax' => 'larr', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2190' ) ),
        'harr' => array( 'syntax' => 'harr', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2194' ) ),
        '|->' => array( 'syntax' => '|->', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '21A6' ) ),
        'lArr' => array( 'syntax' => 'lArr', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '21D0' ) ),
        'hArr' => array( 'syntax' => 'hArr', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '21D4' ) ),

        // unary symbols
        'sqrt'  => array( 'syntax' => 'sqrt', 'callback' => 'special_func', 'args' => array( SQUARE_ROOT ) ),
        'text'  => array( 'syntax' => 'text', 'callback' => 'special_func', 'args' => array( TEXT ) ),
        'f' => array( 'syntax' => 'f', 'callback' => 'special_func', 'args' => array( IDENTIFIER ) ),
        'g' => array( 'syntax' => 'g', 'callback' => 'special_func', 'args' => array( IDENTIFIER ) ),
        // array( 'syntax' => '"(0)"', 'callback' => 'func', 'args' => array( TEXT ) ), // hard-code this...

        // calculus
        'int'   => array( 'syntax' => 'int', 'callback' => 'func_eater', 'args' => array( SUB_SUPERSCRIPT, HEX, '22C1' ) ),
        'oint'  => array( 'syntax' => 'oint', 'callback' => 'func_eater', 'args' => array( SUB_SUPERSCRIPT, HEX, '22C0' ) ),

        'del'   => array( 'syntax' => 'del', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2202' ) ),
        'grad'  => array( 'syntax' => 'grad', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2207' ) ),
        'prime' => array( 'syntax' => 'prime', 'callback' => 'output', 'args' => array( OPERATOR, HEX, '2032' ) ),

        'dim'   => array( 'syntax' => 'dim', 'callback' => 'output', 'args' => array( OPERATOR, AS_IS ) ),
        'mod'   => array( 'syntax' => 'mod', 'callback' => 'output', 'args' => array( OPERATOR, AS_IS ) ),
        'lub'   => array( 'syntax' => 'lub', 'callback' => 'output', 'args' => array( OPERATOR, AS_IS ) ),
        'glb'   => array( 'syntax' => 'glb', 'callback' => 'output', 'args' => array( OPERATOR, AS_IS ) ),
    );
    private static $__UNDER_OVER = array(
        // operators
        'sum'   => array( 'syntax' => 'sum', 'callback' => 'func_eater', 'args' => array( OPERATOR, UNDER_OVER, HEX, '2211' ) ),
        'prod'  => array( 'syntax' => 'prod', 'callback' => 'func_eater', 'args' => array( OPERATOR, UNDER_OVER, HEX, '220F' ) ),
        'vvv'   => array( 'syntax' => 'vvv', 'callback' => 'func_eater', 'args' => array( OPERATOR, UNDER_OVER, HEX, '22C1' ) ),
        '^^^'   => array( 'syntax' => '^^^', 'callback' => 'func_eater', 'args' => array( OPERATOR, UNDER_OVER, HEX, '22C0' ) ),
        'uuu'   => array( 'syntax' => 'uuu', 'callback' => 'func_eater', 'args' => array( OPERATOR, UNDER_OVER, HEX, '22C3' ) ),
        'nnn'   => array( 'syntax' => 'nnn', 'callback' => 'func_eater', 'args' => array( OPERATOR, UNDER_OVER, HEX, '22C5' ) ),
        'min'   => array( 'syntax' => 'min', 'callback' => 'func_eater', 'args' => array( OPERATOR, SUB_SUPERSCRIPT, AS_IS ) ),
        'max'   => array( 'syntax' => 'max', 'callback' => 'func_eater', 'args' => array( OPERATOR, SUB_SUPERSCRIPT, AS_IS ) ),
        'Lim'   => array( 'syntax' => 'Lim', 'callback' => 'func_eater', 'args' => array( OPERATOR, UNDER_OVER, AS_IS ) ),
        'lim'   => array( 'syntax' => 'lim', 'callback' => 'func_eater', 'args' => array( OPERATOR, UNDER_OVER, AS_IS ) ),
    );

    private static $__SPACE = array(
        'and' => array( 'syntax' => 'and', 'callback' => 'output', 'args' => array( TEXT, AS_IS ) ),
        'or' => array( 'syntax' => 'or', 'callback' => 'output', 'args' => array( TEXT, AS_IS ) ),
        'if' => array( 'syntax' => 'if', 'callback' => 'output', 'args' => array( OPERATOR, AS_IS ) ),
    );

    private static $__UNARY = array(
        // standard function
        'sin'   => array( 'syntax' => 'sin', 'callback' => 'func_eater', 'args' => array( OPERATOR, SUB_SUPERSCRIPT, AS_IS ) ),
        'cos'   => array( 'syntax' => 'cos', 'callback' => 'func_eater', 'args' => array( OPERATOR, SUB_SUPERSCRIPT, AS_IS ) ),
        'tan'   => array( 'syntax' => 'tan', 'callback' => 'func_eater', 'args' => array( OPERATOR, SUB_SUPERSCRIPT, AS_IS ) ),
        'csc'   => array( 'syntax' => 'csc', 'callback' => 'func_eater', 'args' => array( OPERATOR, SUB_SUPERSCRIPT, AS_IS ) ),
        'sec'   => array( 'syntax' => 'sec', 'callback' => 'func_eater', 'args' => array( OPERATOR, SUB_SUPERSCRIPT, AS_IS ) ),
        'cot'   => array( 'syntax' => 'cot', 'callback' => 'func_eater', 'args' => array( OPERATOR, SUB_SUPERSCRIPT, AS_IS ) ),
        'sinh'  => array( 'syntax' => 'sinh', 'callback' => 'func_eater', 'args' => array( OPERATOR, SUB_SUPERSCRIPT, AS_IS ) ),
        'cosh'  => array( 'syntax' => 'cosh', 'callback' => 'func_eater', 'args' => array( OPERATOR, SUB_SUPERSCRIPT, AS_IS ) ),
        'tanh'  => array( 'syntax' => 'tanh', 'callback' => 'func_eater', 'args' => array( OPERATOR, SUB_SUPERSCRIPT, AS_IS ) ),
        'log'   => array( 'syntax' => 'log', 'callback' => 'func_eater', 'args' => array( OPERATOR, SUB_SUPERSCRIPT, AS_IS ) ),
        'ln'    => array( 'syntax' => 'ln', 'callback' => 'func_eater', 'args' => array( OPERATOR, SUB_SUPERSCRIPT, AS_IS ) ),
        'det'   => array( 'syntax' => 'det', 'callback' => 'func_eater', 'args' => array( OPERATOR, SUB_SUPERSCRIPT, AS_IS ) ),
        'gcd'   => array( 'syntax' => 'gcd', 'callback' => 'func_eater', 'args' => array( OPERATOR, SUB_SUPERSCRIPT, AS_IS ) ),
        'lcm'   => array( 'syntax' => 'lcm', 'callback' => 'func_eater', 'args' => array( OPERATOR, SUB_SUPERSCRIPT, AS_IS ) ),
        'f' => array( 'syntax' => 'f', 'callback' => 'func_eater', 'args' => array( IDENTIFIER, AS_IS ) ),
        'g' => array( 'syntax' => 'g', 'callback' => 'func_eater', 'args' => array( IDENTIFIER, AS_IS ) ),
    );
    private static $__BINARY = array(
        // binary symbols
        'frac'  => array( 'syntax' => 'frac(0)(1)', 'callback' => 'non_math_func', 'args' => array( FRACTION ) ),
        'root'  => array( 'syntax' => 'root(1)(0)', 'callback' => 'non_math_func', 'args' => array( ROOT ) ),
        'stackrel'  => array( 'syntax' => 'stackrel(1)(0)', 'callback' => 'non_math_func', 'args' => array( OVER ) ),
    );
    private static $__SPECIAL_UNARY_FUNC = array(
        // font commands
        'bb' => array( 'syntax' => 'bb', 'callback' => 'non_math_func', 'args' => array( STYLE, ATTRIBUTE, array( 'mathvariant' => 'bold' ) ) ),
        'bbb' => array( 'syntax' => 'bbb', 'callback' => 'non_math_func', 'args' => array( STYLE, ATTRIBUTE, array( 'mathvariant' => 'double-struck' ) ) ),
        'cc' => array( 'syntax' => 'cc', 'callback' => 'non_math_func', 'args' => array( STYLE, ATTRIBUTE, array( 'mathvariant' => 'script' ) ) ),
        'tt'    => array( 'syntax' => 'tt', 'callback' => 'non_math_func', 'args' => array( STYLE, ATTRIBUTE, array( 'mathvariant' => 'monospace' ) ) ),
        'fr' => array( 'syntax' => 'fr', 'callback' => 'non_math_func', 'args' => array( STYLE, ATTRIBUTE, array( 'mathvariant' => 'fraktur' ) ) ),
        'sf' => array( 'syntax' => 'sf', 'callback' => 'non_math_func', 'args' => array( STYLE, ATTRIBUTE, array( 'mathvariant' => 'sans-serif' ) ) ),

        // accents
        'hat'   => array( 'syntax' => 'hat', 'callback' => 'non_math_func', 'args' => array( OVER, HEX, '0302' ) ),
        'bar'   => array( 'syntax' => 'bar', 'callback' => 'non_math_func', 'args' => array( OVER, HEX, '0305' ) ),
        'ul'    => array( 'syntax' => 'ul', 'callback' => 'non_math_func', 'args' => array( UNDER, HEX, '0332' ) ),
        'vec'   => array( 'syntax' => 'vec', 'callback' => 'non_math_func', 'args' => array( OVER, HEX, '20D7' )  ),
        'dot'   => array( 'syntax' => 'dot', 'callback' => 'non_math_func', 'args' => array( OVER, HEX, '0307' )  ),
        'ddot'  => array( 'syntax' => 'ddot', 'callback' => 'non_math_func', 'args' => array( OVER, HEX, '0308' ) ),
        'tilde' => array( 'syntax' => 'tilde', 'callback' => 'non_math_func', 'args' => array( OVER, HEX, '0303' ) ),
    );
    private static $__EXPRESSION = array(
        '/' => array( 'syntax' => '(0)/(1)', 'callback' => 'expression', 'args' => array( FRACTION ) ),
        '_' => array( 'syntax' => '(0)_(1)', 'callback' => 'expression', 'args' => array( SUBSCRIPT, HIDDEN ) ),
        '^' => array( 'syntax' => '(0)^(1)', 'callback' => 'expression', 'args' => array( SUPERSCRIPT, HIDDEN ) ),
        //'(0)_(1)^(2)' => array( 'syntax' => '(0)_(1)^(2)', 'callback' => 'expression', 'args' => array( SUB_SUPERSCRIPT ) ),
    );
    private static $__GROUPING_BRACKETS_LEFT = array(
        '(' => array( 'syntax' => '(', 'callback' => 'output', 'args' => array( FENCED, AS_IS ) ),
        '[' => array( 'syntax' => '[', 'callback' => 'output', 'args' => array( FENCED, AS_IS ) ),
        '{' => array( 'syntax' => '}', 'callback' => 'output', 'args' => array( FENCED, AS_IS ) ),
        '{:' => array( 'syntax' => '{:', 'callback' => 'output', 'args' => array( FENCED, HIDDEN ) ),
        '(:' => array( 'syntax' => '(:', 'callback' => 'output', 'args' => array( FENCED, HEX, '2329' ) ),
        //'|' => array( 'syntax' => '|', 'callback' => 'output', 'args' => array( FENCED, AS_IS ) ),
    );
    private static $__GROUPING_BRACKETS_RIGHT = array(
        ')' => array( 'syntax' => ')', 'callback' => 'output', 'args' => array( FENCED, AS_IS ) ),
        ']' => array( 'syntax' => ']', 'callback' => 'output', 'args' => array( FENCED, AS_IS ) ),
        '}' => array( 'syntax' => '{', 'callback' => 'output', 'args' => array( FENCED, AS_IS ) ),
        //'|' => array( 'syntax' => '|', 'callback' => 'output', 'args' => array( FENCED, AS_IS ) ),
        ':}' => array( 'syntax' => ':}', 'callback' => 'output', 'args' => array( FENCED, HIDDEN ) ),
        ':)' => array( 'syntax' => ':)', 'callback' => 'output', 'args' => array( FENCED, HEX, '232A' ) ),
    );
?>

The grammar for ASCIIMath syntax is given in this file by the authors, which is what I used to create the get_regex_for method:

/**
 * Parsing ASCII math expressions with the following grammar
 * v ::= [A-Za-z] | greek letters | numbers | other constant symbols
 * u ::= sqrt | text | bb | other unary symbols for font commands
 * b ::= frac | root | stackrel         binary symbols
 * l ::= ( | [ | { | (: | {:            left brackets
 * r ::= ) | ] | } | :) | :}            right brackets
 * S ::= v | lEr | uS | bSS             Simple expression
 * I ::= S_S | S^S | S_S^S | S          Intermediate expression
 * E ::= IE | I/I                       Expression
 *
 * Each terminal symbol is translated into a corresponding mathml node.
 */
karbuncle
  • 21
  • 2

0 Answers0