I am currently writing a lexer using regular expressions as described in this post: Poor man's "lexer" for C#
While it was much faster than what I already had, I just didn't like how things still took roughly 500ms per file (timed in a loop of 100x36k tokens with Stopwatch).
After moving around the precedence of my tokens, I cut the 500ms in half already and I gained an additional 50ms (roughly) by adding a "simple match" boolean to most of my tokens (which basically means it should use a simple string.Contains(Ordinal)
rather than Regex.Match
).
For best performance, I obviously want to get rid of most, if not all Regex.Match
calls. For that to be possible, I need something which simulates the \b
tag in Regex, otherwise known as a word boundary (meaning it should only match the whole word).
While I can go wild and write a simple method which checks if the character before and after my "simple match" is a non-word character, I was wondering if .NET would have something for this built-in?
If I would end up having to write my own method, what would be the best approach? Pick the index of the character after my word and check if it's byte value is lower than whatever? Any tips regarding this would also be welcome!