This has been on my mind for a while. I'm intrigued by recursive descent parsers, and would like to know how to implement one. What I want is a simple parser that will understand simple arithmetic such as "5+5", or "(5+5)*3".
I figure the first step is to write a 'tokenizer', which takes the entire input string and breaks it into many substrings. This part I have done (I've even had to ask about it here. You don't have to follow the link if you don't want to, since I'm posting the relavent code here as well.) With this tokenizer of mine, I end up with a vector
of string
s, or tokens. Now, the hard part: I'd like to parse those tokens.
I've read the Wikipedia article on recursive descent parsers. I do understand the overall concept, but as always, the implementation is a bit confusing. In that article, there is a C implementation of a recursive descent parser for a very simple programming language, also discussed in the article. I studied that code as best I could, and tried to basically write the same thing, but for my program. Below is that code.
What I am really confused about is what this parser does. It seems to go through the program and 'expect' certain parts of the grammar. But once it gets there, what does it do? For example, here is one function from the Wikipedia code that is supposed to parse a 'term':
void term(void) {
factor();
while (sym == times || sym == slash) {
getsym();
factor();
}
}
This is meant to parse this grammar:
term = factor {("*"|"/") factor} .
which makes sense. But what does it do with the actual term? Let's say that term is simply "6", or was "3*2" and came out to be of value 6. How would it incorporate that into the rest of the input? Shouldn't term()
return a double
instead of void
(to return the 6)? Or is it done some other way?
Also, what would be the difference between getting a parser like this to output code, and to immediately act upon the input (ie compiler vs. interpreter)? Are those two (in this example at least) theoretically implemented the same way, or are they fundamentally different?
Any input is welcome. Here is my code thus far:
#include <iostream>
#include <string>
#include <vector>
#include <ctype.h>
#include <sstream>
using namespace std;
vector<string> symbolize(string);
bool accept(string);
void getSymbol();
void error(string s);
bool expect(string);
void expression();
void term();
void factor();
int currentPosition = -1;
string symbol;
vector<string> symbols;
int main(int argc, const char * argv[])
{
string input;
getline(cin,input);
symbols = symbolize(input);
getSymbol();
expression();
return 0;
}
void factor(){
if(isdigit(symbol.c_str()[0])){}
else if(accept("(")){
expression();
expect(")");
}
else {
error("Syntax error");
}
}
void term(){
factor();
while(symbol=="*"||symbol=="/"){
getSymbol();
factor();
}
}
void expression(){
if(symbol == "+" || symbol == "-") getSymbol();
term();
while(symbol == "+" || symbol == "-"){
getSymbol();
term();
}
}
void error(string s){
cout << endl << "ERROR: " << s << endl;
}
void getSymbol(){
currentPosition++;
if(currentPosition>=symbols.size())error("Unexpectedly reached end of input");
}
bool expect(string s){
if(accept(s))return true;
else error("Expected '" + s + "'");
return false;
}
bool accept(string s){
if(s==symbol){getSymbol();return true;}
return false;
}
// Takes a string and breaks it into substrings
vector<string> symbolize(string input){
int position = 0;
char c;
//stringstream s;
vector<string> symbols;
enum symbolType {TEXT,OPERATOR}symbolType,charType;
while(position < input.size()){
stringstream s;
c = input.at(position);
if(isalnum(c))symbolType = TEXT;
else symbolType = OPERATOR;
charType = symbolType;
while(symbolType == charType){
s << c;
position++;
if(position>=input.length())break;
c = input.at(position);
if(isspace(c)||c=='\n'){position++; break;}
if(isalnum(c)) charType = TEXT;
else charType = OPERATOR;
}
symbols.push_back(s.str());
}
return symbols;
}
Edit: I should mention that my code always prints: ERROR: syntax error
, from the factor()
function.