Parsing data into structures using bison

Question

I learned about flex and bison in school over the summer, and now I want to dive a little deeper. I'm having trouble understanding the documentation for Bison 3.0.2. Maybe some of you could help me out. I want to parse a string representing an equation, and at the same time, fill out a data structure which contains information about what was parsed. For instance, say I have (ax+b)^2. I want the parser to generate a structure containing a string and an integer constant that looks like the following.

(  BEGINGROUP
a  VARIABLE
x  VARIABLE
+  ADDITION
b  VARIABLE
)  ENDGROUP

I have created a language specification using flex, and I have created a grammar using bison. All that's needed is to get the two to put information into the structures. I have some code which works in kind of the way I want it to, but I can't help but think I'm missing something. In the Bison documentation examples, I see them using $$ or $1 which they say is to check semantic values? When I print the semantic values, I always get zero. Anyway, my code is posted below.

math.l

%{
 #include "equation.h"

Equation* equ;
void setEquation(Equation* equation) {
    equ = equation;
}
%}

/* Definitions */
space     [ \t]
digit     [0-9]
letter    [a-zA-Z]
number    ({digit}+|{digit}+"."{digit}+|"."{digit}+|{digit}+".")
variable  {letter}

/* actions */
%%
{space}      ;
{number}     equ->addElement(yytext, Equation::number); return(1);
{variable}   equ->addElement(yytext, Equation::variable); return(2);  
"+"          equ->addElement(yytext, Equation::addition); return(10); /* Basic operators */
"-"          return(11);
"*"          return(12);
"/"          return(13);
"^"          return(14);
"log"        return(15); 
"sin"        return(20); /* Trigonometric Functions */
"cos"        return(21);
"tan"        return(22);
"csc"        return(23);              
"sec"        return(24);
"cot"        return(25);
"arcsin"     return(26);
"arccos"     return(27);
"arctan"     return(28); 
"("          equ->addElement(yytext, Equation::begGroup); return(30); /* Grouping Operators */
")"          equ->addElement(yytext, Equation::endGroup); return(31);
"["          return(32);
"]"          return(33);
","          return(34);
.            fprintf(stderr, "Error on character %s\n", yytext);

math.y

/*
* Implement grammer for equations
*/
%{
#include "lex.yy.c"
#include "equation.h"
#include <iostream>

int yylex(void);
int yyerror(const char *msg);

void output(const char* where) {
    std::cout << where << ": " << yytext << std::endl;
}
%}

%token e_num       1
%token e_var       2
%token e_plus     10
%token e_minus    11
%token e_mult     12
%token e_div      13
%token e_pow      14
%token e_log      15 
%token e_sin      20
%token e_cos      21
%token e_tan      22
%token e_csc      23              
%token e_sec      24
%token e_cot      25
%token e_asin     26
%token e_acos     27
%token e_atan     28 
%token lparen     30
%token rparen     31
%token slparen    32
%token srparen    33
%token comma      34
%start Expression
%%

Expression : Term  MoreTerms 
           | e_minus Term MoreTerms
           ;
MoreTerms  : /* add a term */
             e_plus Term MoreTerms
           | /* subtract a term */
             e_minus Term MoreTerms
           | /* add a negetive term */
             e_plus e_minus Term MoreTerms /* Add a negetive term */
           | /* minus a negetive term */
             e_minus e_minus Term MoreTerms /* Subtract a negetive term */
           | /* no extra terms */
           ;
Term       : Factor MoreFactors {equ->addElement("*", Equation::multiplication)};
           ;
MoreFactors: e_mult Factor MoreFactors
           | e_div  Factor MoreFactors
           | Factor MoreFactors
           |
           ;
Factor     : e_num { std::cout << $1 << std::endl; } //returns zero no matter where I put this
           | e_var
           | Group 
           | Function 
           ;
BeginGroup : lparen | slparen; 
EndGroup   : rparen | srparen;
Group      : BeginGroup Expression EndGroup 
              ;

Function   : TrigFuncs
           | PowerFunc
           ;
TrigFuncs  : e_sin lparen Expression rparen
           | e_cos lparen Expression rparen
           | e_tan lparen Expression rparen
           | e_csc lparen Expression rparen
           | e_sec lparen Expression rparen
           | e_cot lparen Expression rparen
           | e_asin lparen Expression rparen
           | e_acos lparen Expression rparen
           | e_atan lparen Expression rparen
           ;
PowerFunc  : e_num e_pow Factor
           | e_var e_pow Factor
           | Group e_pow Factor
           | TrigFuncs e_pow Factor
           ;

I think it's pretty clear what I'm doing. As you can see, the scanner stores yytext into the equation class along with its code, but sometimes the parser must add information to the equation class, and this is where things can get hectic. For one, trying to add code before or in the middle of the statement can lead to massive shift/reduce conflicts. Secondly, the effect of putting the code at the end of statement is to record things out of order. Look at the rule for term. If I type "ax", this implicitly means, "a" times "x" or "a*x". I want the parser to add the multiplication into my structure in, but the parser does this out of order. So is there a better way to accomplish my goal?

Brian Tompsett - 汤莱恩 · Answer 1 · 2020-06-10T21:07:01.660

You're probably wondering why no one has answered your question in four months; when it seems like such a simple problem with a calculator. This is because it is not a simple question to answer. Its a portmanteau of a question with lots of hidden nooks and crannies for the unwary. Now there are quite a few bison experts contributing to answers on Stack Overflow who really know their stuff, and all of them have avoided this like a plague. If you'd simplified the problem and asked about just one thing at a time you might have got some answers, but you just pasted in the whole raft of code and dropped in some "Oh, by the way, fix everything else while you're at it!". However, if someone wanted to debug your code they couldn't because you missed out a whole load of critical stuff, like the Equation object. StackOverflow is not a bunch of free coders you know! Have a read of some of the guidance.

Let's start with something simple; this $$ and $1 stuff which you started with. You called them "semantic values". Not really. They are a mechanism of passing a value back through the parse tree, and also for the lexical analyser to associate a token value to give to the parser. The reason you always get a zero is you never assigned any value to it in the lexer in the first place. The value is assigned by reference to the built in variable yylval. That's in the documentation. Lets take your number lexeme which returns an e_num token. I could write this (in math.l):

{number}     {yylval = atoi(yytext);return(e_num);}

This permits us to now print the value of the number:

Factor     : e_num { std::cout << $1 << std::endl; }

It looks much better if you use the name of the enumerated constant rather than those absolute values. Pretty poor coding to hard-wire those numbers in like that.

I also see that you've created your stack and push things on it in the lexer. Probably not a good plan, but lets go with that for now. You can associate the token with such an object value if you wanted. The %union construct (in bison) is used for that:

%union {
Equation* TokenValue;
}

Now you have to starting typing the terminals and non-terminals of your grammar:

%token <TokenValue> e_enum;

But now we can get the value using these structures if you wish:

{number}    {equ->addElement(yytext, Equation::number); 
            yylval.TokenValue = equ;
            return(e_num);
            }

and in bison:

Factor     : e_num { std::cout << $1->ElementText << std::endl; }

In the bison grammar there are places where you might want to pass back several values from the right hand side of the rule to the left hand side. This is where the $$ = $2 form of construct would be used, but I'll leave you to read about that.

The next thing to mention is the way you have described the grammar of arithmetic expressions. There are two problems, mainly due to the handling of negatives and the grammar form.

There is a bit of computer science we need here. You have (mainly) represented the expressions as lists of operators and operands, which is regarded as a Regular Language or Chomsky Type 3 form. The problem is that arithmetic expressions are, in the main, nested in structure. Nesting requires a Context Free Grammar or Chomsky Type 2. This is why all examples of calculator grammars use the form:

exp : exp OP exp

rather than the listing form you have used.

Now you have used rule hierarchies to get some of the priorities of the operators in the grammar, but unfortunately the unary minus (or negation) operator would have a higher priority than the binary subtraction operator and thus should appear in the Factor rule. These are the reasons why you get so many shift/reduce conflicts. It's just not the right way to do it. There is a reason all the text books do calculators and expressions in a way different to your example. You need to go back to the text books.

This is why people study this stuff at college. I hope that helps some readers who are looking at similar problems in the future.

Parsing data into structures using bison

1 Answers1