Simple lexical parser

Question

I want to write a lexical parser for regular text. So i need to detect following tokens:

1) Word 2) Number 3) dot and other punctuation 4) "..." "!?" "!!!" and so on

I think that is not trivial to write "if else" condition for each item. So is there any finite state machine generators for c#? I know ANTLR and other but while i will try to learn how to work with these tools i can write my own "ifelse" FSM.

i hope to found something like:

FiniteStateMachine.AddTokenDefinition(":)","smile");
FiniteStateMachine.AddTokenDefinition(".","dot");
FiniteStateMachine.ParseText(text);

Could you give us a few samples of the text stream that you envisage you might want to parse. This will help decide what direction you should go. — kingchris, Jun 10 '12 at 15:47
@kingchris usual text like article on codebetter, my question, tweet... — Neir0, Jun 10 '12 at 15:49
I suggest using Regular Expressions. Something like "[a-zA-Z\\-]+" will pick up words (a-z and dashes), while "[0-9]*(\\.[0-9]+)?" will pick up numbers (including decimal numbers). Dots and such are similar - "[!\\.\\?]+" - and you can just add whatever punctuation you need inside the square brackets (escaping special Regex characters with a \\, which evaluates to a single \ after C# string escaping). Let me know if this seems interesting; if so, I'll turn it in to a full answer. Check out a [Microsoft Tutorial on Regex](http://msdn.microsoft.com/en-us/library/ms228595%28v=VS.80%29.aspx) — GGulati, Jun 10 '12 at 15:53
@GGulati - When writing regex in C#, it's usually good to use [verbatim strings](http://www.c-sharpcorner.com/uploadfile/harishankar2005/verbatim_literals11262005010742am/verbatim_literals.aspx): `@"[!\.\?]"`, for example. The @ at the beginning means you don't have to escape backslashes. — Justin Morgan - On strike, Jun 10 '12 at 15:56
I'm aware. Sometimes, however, you want to enter in a escaped character (new lines, for example). I've written a Regex lexer in the past for a simple DSL. But yes, in this case it would clean up the expressions quite a bit. Also, Neir, check out [this SO question](http://stackoverflow.com/questions/673113/poor-mans-lexer-for-c-sharp). It's very similar to what you want. — GGulati, Jun 10 '12 at 15:59
@GGulati I donnt think that regular expressions is good for such task for three reasons 1) It's slow 2) it's very hard to save tokens order 3) How to detect "." and "...", ":)" and ":))" etc. — Neir0, Jun 10 '12 at 16:02
Well, speed is always an issue when dealing with strings. I don't know how much of an object it is, but I would want to profile my options first if it was. As for saving token order... I wrote a simple DSL last fall; it included connecting a token with a line number, column number, and character number for easier debugging (ex: Syntax error on line 47, column 4: forgot an opening parentheses). I am of the opinion that this is quite feasible (assuming this is what you mean by token order, of course). — GGulati, Jun 10 '12 at 16:07
The last concern is what regex is excellent at. Let's say you have @"[\.:)]+". It'll match any combination of "." ":" and ")" - including "...", ":)" and ":))" (as well as ".:" and ").". It's greedy, so it'll match ":))" in to ":))" rather than ":)" and ")". — GGulati, Jun 10 '12 at 16:08
@GGulati Well yeah, you are right. I look on answer at your link. It is exactly what I want. Thank you. Sorry but i cannt accept comment. — Neir0, Jun 10 '12 at 16:13
@Neir0 - Regex by itself isn't up to the task, but it can be (and often is) used as part of a larger solution that takes things like context and group-balancing into account. Considering how variable your tokens are, I think a regex component makes sense here. It's very good at matching things that are highly variable within a given set of rules. — Justin Morgan - On strike, Jun 10 '12 at 16:19

score 3 · Accepted Answer · edited Jun 16 '18 at 15:02

I suggest using Regular Expressions. Something like @"[a-zA-Z\-]+" will pick up words (a-z and dashes), while @"[0-9]*(\.[0-9]+)?" will pick up numbers (including decimal numbers). Dots and such are similar - @"[!\.\?]+" - and you can just add whatever punctuation you need inside the square brackets (escaping special Regex characters with a ).

Poor man's "lexer" for C# is very close to what you are looking for, in terms of being a lexer. I recommend googling regular expressions for words and numbers or whatever else you need to find out what expressions, exactly you need.

EDIT:

Or see Justin's answer for the particular regexes.

Justin Morgan - On strike · Answer 2 · 2012-06-10T16:26:33.090

We need to know specifics on what you consider a word or a number. That being said, I'll assume "word" means "a C#-style identifier," and "number" means "a string of base-10 numerals, possibly including (but not starting or ending with) a decimal point."

Under those definitions, words would be anything matching the following regex:

@"\b(?!\d)\w+\b"

Note that this would also match unicode. Numbers would match the following:

@"\b\d+(?:\.\d+)?\b"

Note again that this doesn't cover hexadecimal, octal, or scientific notation, although you could add that in without too much difficulty. It also doesn't cover numeric literal suffixes.

After matching those, you could probably get away with this for punctuation:

@"[^\w\d\s]+"

Simple lexical parser

2 Answers2