16

This has been on my mind for a while. I'm intrigued by recursive descent parsers, and would like to know how to implement one. What I want is a simple parser that will understand simple arithmetic such as "5+5", or "(5+5)*3".

I figure the first step is to write a 'tokenizer', which takes the entire input string and breaks it into many substrings. This part I have done (I've even had to ask about it here. You don't have to follow the link if you don't want to, since I'm posting the relavent code here as well.) With this tokenizer of mine, I end up with a vector of strings, or tokens. Now, the hard part: I'd like to parse those tokens.

I've read the Wikipedia article on recursive descent parsers. I do understand the overall concept, but as always, the implementation is a bit confusing. In that article, there is a C implementation of a recursive descent parser for a very simple programming language, also discussed in the article. I studied that code as best I could, and tried to basically write the same thing, but for my program. Below is that code.

What I am really confused about is what this parser does. It seems to go through the program and 'expect' certain parts of the grammar. But once it gets there, what does it do? For example, here is one function from the Wikipedia code that is supposed to parse a 'term':

void term(void) {
    factor();
    while (sym == times || sym == slash) {
        getsym();
        factor();
    }
}

This is meant to parse this grammar:

term = factor {("*"|"/") factor} .

which makes sense. But what does it do with the actual term? Let's say that term is simply "6", or was "3*2" and came out to be of value 6. How would it incorporate that into the rest of the input? Shouldn't term() return a double instead of void (to return the 6)? Or is it done some other way?

Also, what would be the difference between getting a parser like this to output code, and to immediately act upon the input (ie compiler vs. interpreter)? Are those two (in this example at least) theoretically implemented the same way, or are they fundamentally different?

Any input is welcome. Here is my code thus far:

#include <iostream>
#include <string>
#include <vector>
#include <ctype.h>
#include <sstream>

using namespace std;

vector<string> symbolize(string);
bool accept(string);
void getSymbol();
void error(string s);
bool expect(string);
void expression();
void term();
void factor();

int currentPosition = -1;
string symbol;
vector<string> symbols;

int main(int argc, const char * argv[])
{

    string input;
    getline(cin,input);

    symbols = symbolize(input);
    getSymbol();
    expression();


    return 0;
}

void factor(){
    if(isdigit(symbol.c_str()[0])){}
    else if(accept("(")){
        expression();
        expect(")");
    }
    else {
        error("Syntax error");
    }

}

void term(){
    factor();
    while(symbol=="*"||symbol=="/"){
        getSymbol();
        factor();
    }
}

void expression(){
    if(symbol == "+" || symbol == "-") getSymbol();
    term();
    while(symbol == "+" || symbol == "-"){
        getSymbol();
        term();
    }
}

void error(string s){
    cout << endl << "ERROR: " << s << endl;
}

void getSymbol(){
    currentPosition++;
    if(currentPosition>=symbols.size())error("Unexpectedly reached end of input");

}

bool expect(string s){
    if(accept(s))return true;
    else error("Expected '" + s + "'");
    return false;
}

bool accept(string s){
    if(s==symbol){getSymbol();return true;}
    return false;
}

// Takes a string and breaks it into substrings
vector<string> symbolize(string input){
    int position = 0;
    char c;
    //stringstream s;
    vector<string> symbols;
    enum symbolType {TEXT,OPERATOR}symbolType,charType;

    while(position < input.size()){
        stringstream s;
        c = input.at(position);
        if(isalnum(c))symbolType = TEXT;
        else symbolType = OPERATOR;
        charType = symbolType;

        while(symbolType == charType){
            s << c;
            position++;
            if(position>=input.length())break;
            c = input.at(position);
            if(isspace(c)||c=='\n'){position++; break;}
            if(isalnum(c)) charType = TEXT;
            else charType = OPERATOR;
        }

        symbols.push_back(s.str());
    }

    return symbols;
}

Edit: I should mention that my code always prints: ERROR: syntax error, from the factor() function.

Community
  • 1
  • 1
  • the demo code doesnt do anything with the input it reads. You have to add that part yourself. – Mooing Duck Apr 30 '12 at 05:39
  • usually, an interpretter will perform the multiplication right after it reads the second number and return the result. A compiler will create a "multiply" object, and give it the objects returned by the left and right sides, and return (or store) that object. – Mooing Duck Apr 30 '12 at 05:40
  • @MooingDuck But how does it know that it's on the second result, when it's the same function running recursively? –  Apr 30 '12 at 05:42
  • @Hassen term makes two calls to factor (kas your code shows), and stores the two results. One is the first, one is the second. The fact that term might be called again recursively is irrelevant. – Mooing Duck Apr 30 '12 at 05:44
  • @MooingDuck So actually multiplying or dividing the numbers would be done in the while loop? –  Apr 30 '12 at 05:46
  • yes, an both ways have that logic in the while loop. Someone should plagerize this info into an answer so it can be voted on, it's hard for me to write proper answers on my phone. – Mooing Duck Apr 30 '12 at 05:55
  • See http://stackoverflow.com/a/25106688/120163. It shows an example that builds trees, but you could replace tree-construction with doing arithmetic, thereby producing a computed answer by the time the parse completes. – Ira Baxter Aug 15 '14 at 04:55

1 Answers1

6

The wikipedia article contains a very complete looking parser (but no lexer!) that does nothing else.

For actually interpreting the result, the general idea is, each parsing function passes back the partly-interpreted result to its parent/caller. The result could be of a different type for each rule in the tree. If you are writing a parser for a complete language, a partially interpreted result could be a simple constant (which could be of any type) or it could be an entire function (which you may need to later compile). In the case of an equation parser, each rule would simply perform the required operation on the elements it gets from calling other functions, and pass the result back up to the function that called it.

There are two approaches that come to mind:

  1. Have each function accept a something* result parameter. In the case of a simple equation parser this would probably be float* result for all of the elements.

  2. Simply return the result, by changing all the functions from void rule_x()... to float rule_x()....

In either case you will need some way to deal with errors. If you are in C you have no exceptions so you would probably be best using option 1 and using the return value to indicate success. Then there would be lots of

if(!expression(&result)) return 0;

But in C++ you could wrap the parse in an exception handler, and throw an exception on error that aborted the rest of the parse.

Things get much more interesting when you want to, say compile an entire language with optimisation or JIT, and attempt to recover gracefully from syntax errors and keep parsing.

The book to get on the subject is the dragon book.

Michael Slade
  • 13,802
  • 2
  • 39
  • 44