Complicated Regex

Question

I need some help here; I'm trying to make a few Regex expressions in order to catch the word int, any mathematical operations, any digits and = signs in my code, while ignoring all the rest. The words which will be ignored will be set to false, while others true as shown in the code below.

This will be used to Tokenize the above mentioned keywords in order to implement a Lexer which can detect integer overflows. I need this done exclusively with Regex.

I've already successfuly captured the word int, mathematical operations and digits, but my Regex can't seem to recognize any random words; such as variable names (number1, number2, etc) and any other words inside the language, such as if statements, round braces, curly brackets, etc...

        lexer.AddDefinition(new TokenDefinition(
            "(operator)",
            new Regex(@"\*|\/|\+|\-"),
            false));

        lexer.AddDefinition(new TokenDefinition(
            "(literal)",
            new Regex(@"\d+"),
            false));

        lexer.AddDefinition(new TokenDefinition(
            "(Random Word)",
            new Regex(@"(?=.*[A-Z])(?=.*[a-z])"),
            false));

        lexer.AddDefinition(new TokenDefinition(
            "(integer)",
            new Regex(@"\bint\b"),
            false));

        lexer.AddDefinition(new TokenDefinition(
            "(white-space)",
            new Regex(@"\s+"),
            true));


       // This is not working.  Random words such as variable names are not being captured by this.
        lexer.AddDefinition(new TokenDefinition(
            "(random-word)",
            new Regex(@"\b(?=.*[A-Z])(?=.*[a-z])\b"),
            true));

       // What about the brackets?  How can I implement a Regex to capture brackets?

This seems to be so simple but I can't get it done. Please share your views, any opinions are welcome.

Have you thought about using an LL(1) Parser-generator like coco/r or similar? http://www.thefreecountry.com/programming/compilerconstruction.shtml — TGlatzer, Dec 23 '14 at 12:59

asontu · Accepted Answer · 2014-12-23T12:50:06.767

2

Both \b and (?=...) are zero-length matches. In other words they don't capture anything, they just assert a condition and the regex fails if that condition isn't met.

I'm unsure what exactly you mean with "random words" but going with what variables look like in C# I would do this:

\b[a-zA-Z_]\w*\b

This matches a word-boundary, then a letter or underscore, followed by 0 or more letters/underscores/numbers and ending with a word-boundary.

Small update after comments: This will not give issues with non-ASCII chars and won't match int which is already handled by the other TokenDefiner.

\b(?!int)[a-zA-Z_][a-zA-Z0-9_]*\b

edited Dec 23 '14 at 12:50

answered Dec 23 '14 at 12:44

asontu

4,548
1
21
29

That's very good, but note that `\w` also matches non-ASCII letters/digits, which might not be desired. – Tim Pietzcker Dec 23 '14 at 12:46
I apologize, "random words" is out of context there. What I meant is anything else other the keyword int, + * - / =, and any digit. Pratically anything else other than what has to do with arithmentics and integers. – ClaireG Dec 23 '14 at 12:46
Basically I have a method inside TokenDefiniton which will match a string source and point out where there is a word 'int', the math operands, equal sign or any digit. The rest needs to be ignored, however it seems that I still need to match the "other words" with regex, in order to catch with an exception. – ClaireG Dec 23 '14 at 12:49
1

Depending on what you're gonna do with the found matches, you might wanna look at **[this answer](http://stackoverflow.com/a/25402157/2684660)** to see how you make sure you `literal` TokenDefiner won't match the `123` in `var123 = 456` etc. Generally I would advise against using multiple small regexes and in stead of using one bigger regex and analyze the result it gives back to determine what to do with the match. – asontu Dec 23 '14 at 12:55

Complicated Regex

1 Answers1