Parsing complex mathematical expressions

Question

I'm currently writing (for fun) a programming language heavily inspired by mathematics.

The operators (and their precedence/associativity) are defined within the language, for example:

let factorial be an operator[int <- int!]
    with precedence of 2
    and left-to-right associativity;
let factorial(n) = ...;

let abs be an operator[number <- |number|]
    with precedence of 2
    and no associativity;
let abs(x) = ...;

let not be an operator[bool <- !bool]
   with precedence of 2
   and right-to-left associativity;

This holds true for more common operators (+, -, *, /, ...).

NB: the syntax is not definitive

In order to generate a useful AST, I decided to have multiple parsing steps. In the first step, an expression is just a list of terms and symbols (with parenthesis grouping supported), for example:

m! / (n! * (m - n)!)

Will be parsed as (simplified AST here):

["m", "!", "/", ["n", "!", "*", ["m", "-", "n"], "!"]]

After the first step, I know what operators are defined, I know their order of priority, and their associativity.

I'm having difficulties implementing the second step that is supposed to generate a tree (instead of a list) for the expression.

The first step is implemented using a library taking an EBNF syntax as input. For the second step, I'm trying to "find" in the expression the patterns of the operators (using a homemade Parser Combinator):

let factorial be an operator[int <- int!] ...;
# here the pattern is ["int", "!"]

But this method does not respect the precedence nor the associativity.

If anyone has a suggestion on how to proceed from here, or a link to a paper on the subject, it would be very welcome since I'm running out of hairs to pull :)

rici · Accepted Answer · 2020-08-02T01:11:34.643

You can use an operator precedence parser, such as the infamous Shunting Yard Algorithm. (This answer might also be helpful. Or maybe not.)

The shunting-yard algorithm is often presented (including in that Wikipedia article) as a way of "translating infix to reverse polish". It's not. It's a parsing algorithm, and it can easily produce an AST, or three-address code, or whatever else you want to generate with it. But I agree that an AST is the best option.

To modify the algorithm on Wikipedia to an algorithm which produces an syntax tree, do the following:

Replace what they call "the output queue" with a stack of tree nodes.
When they "push a value onto the output queue", create an AST node of the value (a literal or an identifier, most likely), and push it onto the stack.}
When they "push an operator onto the output queue": the nodes on the top of the stack are the child nodes of the operator syntax node. So pop them (remembering that the one on the top of the stack is the rightmost argument, for operators which take more than one operand) and combine them into a new operator node. Then push this node back onto the stack.
At the end of the parse, the stack will contain exactly one element (if all operators were used with the correct number of arguments), which is the AST node for the entire expression.

Many languages which allow operator syntax to be defined in the source code use exactly this algorithm.

Having done a couple of those myself, let me offer a couple of pieces of advice which you are free to ignore:

Although these days Unicode offers the possibility to use all sorts of single "characters" as mathematical operators, the reality is that ⊛ is not all that easy to type, and a library which insisted you used it for some function might not be all that popular with users. Of course, you could try writing your own cross-platform Unicode-aware IDE, but that strikes me as a very deep and convoluted rabbit hole, a maze of twisty passages all alike. So it's quite likely that you'll want to allow that to be typed as something like (*). But then you have the problem that you don't know what the tokens are until you read the operator declarations. There are no truly satisfying solutions (until someone writes the aforementioned IDE), and most of the languages I've seen which allow custom operators insist that they be separated with whitespace, after having designated some set of characters which can be used as operator characters.
As anyone who has written bitwise expressions in C or C++ knows, having a lot of different operator precedences is not all that user friendly. Most of us don't have intuitions which guide us in understanding how x + 2 << n ^ 1 | 7 is intended to be parsed, and most style guides require you (reasonably, IMHO) to add redundant parentheses even if you know precisely what that expressions means, because people reading your code probably don't. ((((x + 2) << n) ^ 1) | 7)). When the operators have just been added ad hoc to the syntax, intuitions are probably even less helpful. But while only a few might guess that xor comes between and and or, or that all bitwise operators except ~ bind more tightly than booleans, it probably is common to understand that multiplication-type operators have higher precedence than additive-type operators in the same domain.

Anyway, the point I intended to make is that declarations like "⋇ has precedence 5", while perfectly reasonable in the library which defines ⋇, doesn't compose well, because "5" is a completely arbitrary number. Perhaps there is a different library intended for a different domain, which declares that "⋈ has precedence 3". The authors of the two libraries probably never came together to decide what the relative precedence of ⋇ and ⋈ are, and the fact that one used 5 and the other 3 does not indicate that one of those operators "is intended" to bind more tightly than the other one. Personally, I prefer a style where operator precedences are defined relative to other operators in the same module ("∷ has the same precedence as ⋕ and higher precedence than ⊹".) From those declarations, you could build a partial ordering, which is sufficient for operator-precedence parsing; if the parser finds two unparenthesized operators which do not have a declared precedence order, it should complain rather than just using a comparison between arbitrary numbers out of any context. But that's just my opinion.)

thank you for your answer! for your first point I was not intending to use unicode, I like to stick to what you can easily type on your keyboard (I don't want to make APL 2.0 and provide a keyboard). As for your second point, this is the Perl 6 way ( https://perlgeek.de/blog-en/perl-5-to-6/13-custom-operators.html ) and it is way more readable in fact ! the use of parenthesis to help readability should be the developer's concern IMHO, the language still needs predictable rules (even complex ones). — linkdd, Aug 01 '20 at 12:12
@linkdd: Yes, relative precedence declarations like `multi sub infix: is tighter(&infix:<+>)` are essentially what I was advocating. It is certainly better than using arbitrary numbers as precedence levels. However, at least afaics from the Raku docs, operator precedence is an almost total ordering (except for non-associative precedence levels). My way is a lot more strict (but in practice, it works out fine). I didn't find anything in the docs that explains what happens when your definitions are ambiguous, which certainly seems possible. (Eg. a postfix operator is also infix). — rici, Aug 02 '20 at 01:30
It is possible to detect the ambiguity, so it's at least conceivable that an error message is produced. I didn't try to install Raku to find out. Anyway, good luck with your project. — rici, Aug 02 '20 at 01:32

Parsing complex mathematical expressions

1 Answers1