I'm trying to write a scanner for my C/C++/C#/Java/D-like programming language that I'm designing for personal reasons. For this task I'm using Ragel to generate my scanner. I'm having trouble understanding exactly when a lot of the operators trigger actions, probably because my academics were focused on practical knowledge rather than theory and great deal of this non-deterministic/deterministic finite automata business goes right over my head. I find the documentation to either be lacking or my understanding of it to be so. I'm assuming the latter.
In any case, I'm working my way up from the basics. I've identified several keywords and special characters in my first iteration. Now I've run into the issue where all keywords are being scanned as identifiers. I'm using the scanner operator for all of my keywords, as that resolved my issue of the string returns
being scanned as both the return
and returns
keyword.
How can I properly scan for identifiers? I understand that to make this deterministic, I need to effectively specify that a lexeme can only be an identifier
if it matches no other token's pattern. Forgive my lack of knowledge.
Ragel Script:
%%{
Identifier = (alpha | '_') . (alnum | '_')*;
action IdentifierAction
{
std::cout << "identifier(\"";
std::cout.write(ts, te - ts);
std::cout << "\")";
}
}%%
%%{
main :=
|*
Interface => InterfaceAction;
Class => ClassAction;
Property => PropertyAction;
Function => FunctionAction;
TypeQualifier => TypeQualifierAction;
OpenParenthesis => OpenParenthesisAction;
CloseParenthesis => CloseParenthesisAction;
OpenBracket => OpenBracketAction;
CloseBracket => CloseBracketAction;
OpenBrace => OpenBraceAction;
CloseBrace => CloseBraceAction;
Semicolon => SemicolonAction;
Returns => ReturnsAction;
Return => ReturnAction;
Identifier => IdentifierAction;
space+;
*|;
}%%