Why is a binary operator selected over a unary operator?

Question

For operators that act as both unary and binary, why is the binary one selected in an expression like a@b?

After a lot of thought and searching, I still haven't been able to answer why something like a+b is parsed as a binary expression instead of a(+b), which would obviously be gibberish.

I don't think a context free grammar would be able to distinguish the two, and trying to find an answer in this version of the standard has not given me any answers.

Does the parser choose the binary version specifically because the unary version would be gibberish? If so, is there a section in the standard that outlines this?

@MooingDuck you're right, at least for this, it's just an [ambiguous grammar](https://en.wikipedia.org/wiki/Ambiguous_grammar) but can still be context free. Which makes sense because the grammar outlines the rules for creating well formed formulas not analyzing the semantics. — Colin Hicks, Feb 03 '21 at 00:52
@ColinHicks: It's not ambiguous though. Either the expression is `A → E + E` or it's `A → + E`. They have unique leftmost derivation or parse trees, and so are fully unambiguous. — Mooing Duck, Feb 03 '21 at 01:11
@MooingDuck its not an ambiguous grammar? What about most vexing parse](https://en.m.wikipedia.org/wiki/Most_vexing_parse) which can be generated by two completely different production rules. Or a+++b which again is a valid string in the language but is chosen to be a++ +b while it could have easily have been a+ ++b. — Colin Hicks, Feb 03 '21 at 01:34
@ColinHicks `a+b` vs `+b` is always unambiguous. You're right that C++ as a whole is very ambiguous, and I did not mean to imply otherwise. — Mooing Duck, Feb 03 '21 at 16:39

John Kugelman · Accepted Answer · 2021-02-03T01:15:33.843

Context-free doesn't mean "no state". A parser has a lot of state to keep track of what grammatical rules are possible given the tokens it's seen so far and to predict what tokens will come next. Because there's no rule that says that two expressions can appear directly adjacent to each other it'll never even consider that a+b could be the expressions a and +b side by side.

For example, let's say that we're using this rudimentary grammar:

expr → expr '+' unary_expr | unary_expr
unary_expr → '+' unary_expr | IDENT

^{(Notation: → gives the rules a non-terminal can expand to and | indicates alternate possibilities. '+' is the plus token and IDENT is any identifier token.)}

Let's parse a+b. Our parser's starting state will be:

1. expr → expr '+' unary_expr
         ^
2. expr → unary_expr
         ^
3. unary_expr → '+' unary_expr
               ^
4. unary_expr → IDENT
               ^

There are the rules it's considering at the start. It doesn't know which it's going to get, could be any of them. Notice that each production it's considering also includes a cursor, which I've marked with ^ carets above. That's where in the rule the parser is.

Okay, so now it sees the first IDENT token. It updates its state to the following:

1. expr → expr '+' unary_expr
              ^
2. expr → unary_expr
                    ^
3. unary_expr → IDENT
                     ^

Now there are three rules that it's considering. Notice that the cursor has moved along to the right in each of them.

If the first rule is correct then it has just seen an expression and is expecting a '+' next. Alternatively, maybe the second rule is the correct one and a is just a unary expression. In that case it expects no more tokens to follow. The parser doesn't know which it's going to be so it's considering both.

You'll see that if the next token is a '+' then it must be a binary plus. Why? Because the first rule is the only rule that anticipates a '+' token next.

For '+' to be interpreted as a unary plus the parser would have to have this rule in its active state, with the cursor before the '+':

unary_expr → '+' unary_expr 
            ^

And you can see that it doesn't.

If context-free doesn't mean stateless, what does it mean, then? What "context" are we "free" from?

Context-free is a restriction on what rules the grammar can contain. The opposite is context-sensitive, where where productions can vary based on their surroundings. Context-sensitive grammars are more powerful than context-free grammars but they are much more difficult to parse—even for humans! Language theoreticians figured out early on that context-free grammars occupy a sweet spot of being powerful enough to be expressive without being overwhelmingly complex to reason about.

For more details see: Context-free grammars versus context-sensitive grammars?

A context-free grammar (CFG) is a grammar where (as you noted) each production has the form A → w, where A is a nonterminal and w is a string of terminals and nonterminals. Informally, a CFG is a grammar where any nonterminal can be expanded out to any of its productions at any point. The language of a grammar is the set of strings of terminals that can be derived from the start symbol.

A context-sensitive grammar (CSG) is a grammar where each production has the form wAx → wyx, where w and x are strings of terminals and nonterminals and y is also a string of terminals. In other words, the productions give rules saying "if you see A in a given context, you may replace A by the string y." It's an unfortunate that these grammars are called "context-sensitive grammars" because it means that "context-free" and "context-sensitive" are not opposites, and it means that there are certain classes of grammars that arguably take a lot of contextual information into account but aren't formally considered to be context-sensitive.

Why is a binary operator selected over a unary operator?

1 Answers1