Non-regex alternative to word boundaries

Question

I am currently writing a lexer using regular expressions as described in this post: Poor man's "lexer" for C#

While it was much faster than what I already had, I just didn't like how things still took roughly 500ms per file (timed in a loop of 100x36k tokens with Stopwatch).

After moving around the precedence of my tokens, I cut the 500ms in half already and I gained an additional 50ms (roughly) by adding a "simple match" boolean to most of my tokens (which basically means it should use a simple string.Contains(Ordinal) rather than Regex.Match).

For best performance, I obviously want to get rid of most, if not all Regex.Match calls. For that to be possible, I need something which simulates the \b tag in Regex, otherwise known as a word boundary (meaning it should only match the whole word).

While I can go wild and write a simple method which checks if the character before and after my "simple match" is a non-word character, I was wondering if .NET would have something for this built-in?

If I would end up having to write my own method, what would be the best approach? Pick the index of the character after my word and check if it's byte value is lower than whatever? Any tips regarding this would also be welcome!

Note that `\b` matches a word boundary only if your words consist of letters, digits and underscores. For all other cases it's useless. — Joey, Oct 10 '13 at 21:12
I am aware, but thanks for the heads-up! As long as I simulate the exact behavior of `\b` my lexer will produce the exact output I expect. — Lennard Fonteijn, Oct 10 '13 at 21:13
Are you precompiling your regex? That could speed up regex matching. Also, [this question](http://stackoverflow.com/questions/5039512/regex-match-in-c-sharp-performance-problem) might provide some insight as to why your regular expressions might be taking longer to process. Without seeing your expressions it's hard to know. — Brandon Buck, Oct 10 '13 at 21:29
I am parsing with regex and getting better performance than that. I second the are you precompiling the regex? — paparazzo, Oct 10 '13 at 21:47
The `Compiled` flag has absolutely zero effect on the performance. I went ahead and implemented a `\b` alternative as I described I would in the question. Which leaves only 4 expressions, I doubt a `CompileToAssembly` would gain me much performance now that I'm already down to 120ms (from the original 500ms). — Lennard Fonteijn, Oct 10 '13 at 22:17
According to the profiler, `Regex.Match` still takes roughly 39% of the time with just 200k calls, where as my `MatchSimple` (the Regex alternative), recieves 2 million calls and still takes less time. So just those 4 regexes, which is nothing more than `^(\w?[\w/.]+)` and 3 more for floats/integers and strings, that's quite a lot... — Lennard Fonteijn, Oct 10 '13 at 22:19

Lennard Fonteijn · Accepted Answer · 2013-10-11T22:49:03.993

Not sure why my initial question is being downvoted as to me it seems rather clear. I was not after getting my Regex' fixed as profiling showed that even the simplest Regex still takes more than I want. It may be a poor mans lexer, but I still want it to perform as best as possible.

The question, however, was if .NET had an alternative to word-boundaries built-in and if not, how I would go about implementing it myself WITHOUT using Regex.

The answer to the first question appears to be No.

As for the second, I wrote an Extension-method for the char class:

public static bool IsWordCharacter(this char character)
{
    return (
        (character >= 'a' && character <= 'z') || 
        (character >= 'A' && character <= 'Z') || 
        (character >= '0' && character <= '9') || 
        character == '_');
}

According to most Regex documentation, this mimics the \w flag (negating this method with ! results in \W obviously), which in return is used in \b, but without matching it in the result.

I then use this in a method something like this:

return 
    text.StartsWith(<needle>, StringComparison.Ordinal) 
    && !text[<length of needle>].IsWordCharacter()
        ? <length of needle> 
        : 0;

After which my underlying code knows if it has to use or drop the token.

Disclaimer: I'm aware it's not a full implementation of \b, but it serves my purpose.

Also, after having converted all my Regex' in this way, I went from 250ms to a mere 50ms for exactly the same file. Lexing all the 110 script files I have to my possession takes less than a second in total, averaging to roughly 7ms per file.

Non-regex alternative to word boundaries

1 Answers1