How to get rule text in yacc

Question

declaration: type instance SEMICOLON
{
    //I want to get string here
}

type:...
instance:...

I want to get the currently matched declaration string in the declaration action.Is there any way to do this in yacc?

The actual matched text of a token is the so-called lexeme. If you need the lexeme in yacc (the parser) you have to consider this in lex (the scanner). The scanner, has to provide the lexeme together with the token. E.g. instead of returning an integral value for the token, you could return a token `struct` with the token value and the corresponding lexeme (at least for the tokens where this is of interest). The most critical part is life-time management i.e. to prevent leaking of memory used to store the lexemes - at least as long as you generate a C parser with yacc. — Scheff's Cat, Jul 12 '22 at 10:03
I was looking for a possible duplicate and found this instead: [lex and yacc (symbol table generation)](https://stackoverflow.com/q/31262197/7478597). This brought me to another idea: For all the relevant tokens, you store the lexemes in a global table (`std::vector`) in the lexer. Thus, the `struct token` passed to the parser needs two integrals only: one for token value and one for the index into the global table. This would simplify the life-time management for the stored lexemes, at least. — Scheff's Cat, Jul 12 '22 at 10:14
A possible dupe but without an answer: [How to find the string of yacc rule?](https://stackoverflow.com/q/64720160/7478597) ...though the comments might be worth to have a look. — Scheff's Cat, Jul 12 '22 at 10:20
FYI: [What is the difference between a token and a lexeme?](https://stackoverflow.com/a/14958865/7478597) — Scheff's Cat, Jul 12 '22 at 10:21
Thank you Cope.So I need to provide lexeme through lexer. But how to change lexer? Are there any related tutorials ? — clay, Jul 12 '22 at 11:28
Doc about lex & yacc (and company) can be found on [The Lex & Yacc Page](http://dinosaur.compilertools.net/). — Scheff's Cat, Jul 12 '22 at 11:45

Stephan Schlecht · Answer 1 · 2022-07-15T09:08:55.900

If one is interested in the input string corresponding to a grammatical rule, and the original whitespaces do not matter, one possibility, as mentioned in the comments, would be for the scanner to supply not only values but additionally also the lexemes.

But this is still not sufficient. The parser would have to concatenate the individual lexemes of the affected token in the semantic actions.

To avoid memory leaks, processed strings must then also be freed accordingly.

Scanner adaptions

Let assume you have a scanner rule for a NUMBER token:

[0-9]+   { yylval = atoi(yytext);  return NUMBER; }

We would now also have to provide the lexeme in addition. This means, however, that a union for the different value types is not enough, but a struct that combines the lexeme and the value union is needed.

A simple struct with only one int value could look as follows:

struct my_token {
    char *text;
    union {
        int ival;
    } u;
};

Then the lexer rule could look like this:

[0-9]+   { yylval->u.ival = atoi(yytext); yylval->text = string_dup(yytext);  return NUMBER; }

Parser adaptions

A simple Yacc parser grammar rule that performs only some numerical computations usually looks something like this:

EXPR:    NUMBER         { $$ = $1; }
     |   EXPR '+' EXPR  { $$ = $1 - $3; }
     ...

The semantic actions would now look more complex: it needs to access the value in a more complex way. A string_merge function would concatenate the lexemes of the different tokens. Most likely, one only accepts the additional complexity if one really needs it. For example, there are simpler options if it is only needed for debugging. Furthermore, the string_merge function should then also free the memory of the individual lexemes.

EXPR:    NUMBER         { $$ = $1; }
     |   EXPR '-' EXPR  { $$.u.ival = $1.u.ival - $3.u.ival; $$.text = string_merge($1.text, $2.text, $3.text); }
     |   EXPR '+' EXPR  { $$.u.ival = $1.u.ival + $3.u.ival; $$.text = string_merge($1.text, $2.text, $3.text); }

Freeing memory in case of parse errors

Small side note: If you use Bison and also need to free memory on parse errors, e.g. because you have a long-lived parser, as in a shell etc., you should read here: https://www.gnu.org/software/bison/manual/html_node/Destructor-Decl.html.

Complete self contained example

The following example illustrates the above points using flex and Bison.

calc.l

%{
    #include "calc.tab.h"
    #include "string_util.h"
%}

%option noyywrap noinput nounput bison-locations

%%

[\t \n]+   {  }
[0-9]+   { yylval->u.ival = atoi(yytext); yylval->text = string_dup(yytext);  return NUMBER; }
";"      { return SEMICOLON; }
.        { yylval->text = string_dup(yytext); return yytext[0]; }

%%

calc.y

To use the struct defined above, one can use the following line in Bison:

%define api.value.type {struct my_token}

see https://www.gnu.org/software/bison/manual/html_node/Value-Type.html

%locations
%define api.pure full
%define parse.error detailed
%code top {
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
}

%code requires {
    struct my_token {
        char *text;
        union {
            int ival;
        } u;
    };
}


%code {
    #include "calc.tab.h"
    #include "string_util.h"

    void yyerror(YYLTYPE* yyllocp, const char* msg);
    int yylex(YYSTYPE* yylvalp, YYLTYPE* yyllocp);

}

%define api.value.type {struct my_token}
%token NUMBER
%token SEMICOLON
%left  '-' '+'
%left  '*' '/'
%left  '(' ')'

%%
EXPR_LIST:                        { $$.u.ival = 0; }
       | EXPR_LIST EXPR SEMICOLON { printf("%d >>%s<<\n", $2.u.ival, $2.text); string_free($2.text); }
       | EXPR_LIST SEMICOLON
       ;

EXPR:    NUMBER         { $$ = $1; }
     |   EXPR '-' EXPR  { $$.u.ival = $1.u.ival - $3.u.ival; $$.text = string_merge($1.text, $2.text, $3.text); }
     |   EXPR '+' EXPR  { $$.u.ival = $1.u.ival + $3.u.ival; $$.text = string_merge($1.text, $2.text, $3.text); }
     |   EXPR '*' EXPR  { $$.u.ival = $1.u.ival * $3.u.ival; $$.text = string_merge($1.text, $2.text, $3.text); }
     |   EXPR '/' EXPR  { $$.u.ival = $1.u.ival / $3.u.ival; $$.text = string_merge($1.text, $2.text, $3.text); }
     |   '(' EXPR ')'   { $$.u.ival = $2.u.ival; $$.text = string_merge($1.text, $2.text, $3.text); }
     ;

%%

void yyerror(YYLTYPE *yyllocp, const char *str) {
    fprintf(stderr, "error: %s in line %d, column %d\n", str, yyllocp->first_line, yyllocp->first_column);
}

int main(void)
{
   yyparse();
   if(allocations != 0) {
       fprintf(stderr, "%d allocations were not freed\n", allocations);
   }
   return 0;
}

string_util.c

The variable allcations is incremented on each allocation and decremented when the memory is freed. When the end of the program is reached without parsing errors, allcations should be 0.

#include <string.h>
#include <stdlib.h>
#include "string_util.h"

int allocations = 0;

char *string_dup(char *str) {
    allocations++;
    return strdup(str);
}

void string_free(const char *ptr) {
    allocations--;
    free((void *)ptr);
}

char *string_merge(const char *str1, const char *str2, const char *str3) {
    size_t len = strlen(str1) + strlen(str2) + strlen(str3) + 1;
    char *ptr = malloc(len);
    allocations++;
    strcpy(ptr, str1);
    strcat(ptr, str2);
    strcat(ptr, str3);
    string_free(str1);
    string_free(str2);
    string_free(str3);
    return ptr;
}

string_util.h

#ifndef STRING_UTIL_H
#define STRING_UTIL_H

extern int allocations;
char *string_dup(char *str);
void string_free(const char *ptr);
char *string_merge(const char *str1, const char *str2, const char *str3);

#endif //STRING_UTIL_H

Test

With the input

 5 + 3 + 2 + 1;
 5 + 2 + 1 + 1; (5 - 4) * (3 + 1) - 2;
 5 - 4 + (1 - 2);

the program prints the following on the debug console:

11 >>5+3+2+1<<
9 >>5+2+1+1<<
2 >>(5-4)*(3+1)-2<<
0 >>5-4+(1-2)<<

So between >> and << the matching string is displayed.

How to get rule text in yacc

1 Answers1