Match source code keywords

Question

I want to match keywords of c# source code with regular expression. Let's say I have "new" keyword. I want to match all "new" keywords that are not inside " ", // (comment) and /* */ (second comment)

I've written so far:

\b[^\w@]new\b

However it's not working for:

new[]
var a = new[] { "bla" };
var string = "new"
foo(); // new
/* new */

How can I improve that regex?

What are you really trying to do? Why do you need to match C# keywords? — Austin Salonen, Jul 11 '13 at 19:47
You should probably use Roslyn. Parsing source code with a regex is a recipe for frustration. — SLaks, Jul 11 '13 at 19:48
I want to create syntax highlight (as html) for source code - I need to decorate keywords with element. I don't want to make a perfect algorithm, but I want to make it by myself (for educational purposes) — , Jul 11 '13 at 19:51
parsing source code is essentially equivalent in complexity to parsing HTML, and we all know how that ends... http://stackoverflow.com/a/1732454/1100441 — Jesse Smith, Jul 11 '13 at 19:52
You should be using a parser for this. That's also a great educational opportunity! — Jesse Smith, Jul 11 '13 at 19:53
Ok, but what about generic highlighting? Let's say that I don't want to handle c# only. So foreach language, I'll need to use different parsers? — , Jul 11 '13 at 19:58
Hmm, I've found sth: http://colorcode.codeplex.com/. I'll try to figure out how it is working. — , Jul 11 '13 at 20:05
@pwas If you actually want valid highlighting, then yes, you need different parsers for different languages. Different languages have different keywords, and just all around different contexts for syntax highlighting. A one size fits all solution is doomed at failure from the start. — Servy, Jul 11 '13 at 20:18
Yeah, I know. But I didn't want to install 3rd parts assemblies for each language. I rather thought about custom mechanism and interface that I could implement for every lanaguage that I want to handle. Anyway, thanks for hints. — , Jul 11 '13 at 20:31
It gets even more complicated if you want to "colorize" code that isn't quite valid, as you would in an editor. — Tom Blodget, Jul 12 '13 at 00:08
You may want to look at the source for [Google Code Prettify](http://code.google.com/p/google-code-prettify/) for some ideas on what to match, it is what stackoverflow uses for it's code highlighting. — Scott Chamberlain, Jul 12 '13 at 01:20

Ro Yo Mi · Answer 1 · 2013-07-12T01:52:40.363

Description

It would be easier to capture all the undesirable matches and all the good things. Then later in programming logic test to see if a capture group was populated, if so then it is a match you want.

This expression will:

avoid all single and double quoted blocks of text like "new" or 'new'
avoid all block commented sections like /* new */
avoid all single line comments // new
any non quoted or commented keywords like new, var, and foo

(\/\*(?:(?!\*\/)|.)*\*\/|\/{2}[^\r\n]*[\r\n]+)|("[^"]*"|'[^']*')|(new|var|foo)|(\w+)

enter image description here

Example

I don't know c#, so I offer a powershell example to demonstrate how I would accomplish this. I made the expression case insensitve and turned on "dot matches new lines" by using (?is) and had to escape all the single quotes in the expression as ''.

Code

$String = 'NEW[]
var a = NEw[] { "bla" };
var string = "new"
foo(); // new
/*
new
*/
'
clear

[regex]$Regex = '(?is)(\/\*(?:(?!\*\/)|.)*\*\/|\/{2}[^\r\n]*[\r\n]+)|("[^"]*"|''[^'']*'')|(new|var|foo)|(\w+)'

# cycle through all matches
$Regex.matches($String) | foreach {

    # Capture group 1 collects the comments, if populated then this match is a comment
    if ($_.Groups[1].Value) {
        Write-Host "comment at " $_.Groups[1].index " with a value => " $_.Groups[1].Value
        } # end if

    # capture group 2 collects the quoted strings, if populated then this match is a quoted string
    if ($_.Groups[2].Value) {
        Write-Host "quoted string at " $_.Groups[2].index " with a value => " $_.Groups[2].Value
        } # end if

    # capture group 3 collects keywords like new, var, and foo, if populated then this match is a keyword
    if ($_.Groups[3].Value) {
        Write-Host "keyword at " $_.Groups[3].index " with a value => " $_.Groups[3].Value
        } # end if

    # capture group 4 collects all the other word character chunks, so these might be variable names
    if ($_.Groups[4].Value) {
        Write-Host "possible variable name at " $_.Groups[4].index " with a value => " $_.Groups[4].Value
        } # end if

    } # next match

Output

keyword at  0  with a value =>  NEW
keyword at  7  with a value =>  var
possible variable name at  11  with a value =>  a
keyword at  15  with a value =>  NEw
quoted string at  23  with a value =>  "bla"
keyword at  33  with a value =>  var
possible variable name at  37  with a value =>  string
quoted string at  46  with a value =>  "new"
keyword at  53  with a value =>  foo
comment at  60  with a value =>  // new

comment at  68  with a value =>  /*
new
*/

score 1 · Accepted Answer · answered Jul 12 '13 at 03:32

Simple, use a lexer. A lexer finds groups of text in a string and generates tokens from those groups. The tokens are then provided with a "type". (Something to define what it is)

A C# keyword is one of the defined C# keywords. A simple Regular expression for this would define borders followed by one of the possible C# keywords. ("\b(new|var|string|...)\b")

Your lexer will find all of the matches in a given string for keywords, create a token for each match, and say that the token "type" is "keyword".

However, like you say, you do not want to find keywords inside quotes, or comments. This is where the lexer really earns its points.

To resolve this case a (regex-based)lexer would use two methods:

Remove all of the matches contained by another match.
Remove a match that uses the same space as another but has a lower priority.

A lexer works in the following steps:

Find all of the matches from the regexes
Convert them to tokens
Order the tokens by index
Loop through each of the tokens comparing the current match with the next match, if the next match is partially contained by this match (or if they both occupy the same space) remove it.

Spoiler Alert Below is a fully functional lexer. It will demonstrate how a lexer works, because it is a fully functional lexer.

For Example:

Given regexes for strings, comments, and keywords, show how a lexer resolves conflicts between them.

//Simple Regex for strings
string StringRegex = "\"(?:[^\"\\\\]|\\\\.)*\"";

//Simple Regex for comments
string CommentRegex = @"//.*|/\*[\s\S]*\*/";

//Simple Regex for keywords
string KeywordRegex = @"\b(?:new|var|string)\b";

//Create a dictionary relating token types to regexes
Dictionary<string, string> Regexes = new Dictionary<string, string>()
{
    {"String", StringRegex},
    {"Comment", CommentRegex},
    {"Keyword", KeywordRegex}
};

//Define a string to tokenize
string input = "string myString = \"Hi! this is my new string!\"//Defines a new string.";


//Lexer steps:
//1). Find all of the matches from the regexes
//2). Convert them to tokens
//3). Order the tokens by index then priority
//4). Loop through each of the tokens comparing
//    the current match with the next match,
//    if the next match is partially contained by this match
//    (or if they both occupy the same space) remove it.


//** Sorry for the complex LINQ expression (not really) **

//Match each regex to the input string(Step 1)
var matches = Regexes.SelectMany(a => Regex.Matches(input, a.Value)
//Cast each match because MatchCollection does not implement IEnumerable<T>
.Cast<Match>()
//Select a new token for each match(Step 2)
.Select(b => 
        new
        {
            Index = b.Index,
            Value = b.Value,
            Type = a.Key //Type is based on the current regex.
        }))
//Order each token by the index (Step 3)
.OrderBy(a => a.Index).ToList();

//Loop through the tokens(Step 4)
for (int i = 0; i < matches.Count; i++)
{
    //Compare the current token with the next token to see if it is contained
    if (i + 1 < matches.Count)
    {
        int firstEndPos = (matches[i].Index + matches[i].Value.Length);
        if (firstEndPos > matches[(i + 1)].Index)
        {
            //Remove the next token from the list and stay at
            //the current match
            matches.RemoveAt(i + 1);
            i--;
        }
    }
}

//Now matches contains all of the right matches
//Filter the matches by the Type to single out keywords from comments and
//string literals.
foreach(var match in matches)
{
    Console.WriteLine(match);
}
Console.ReadLine();

That is a valid(I tested it) almost-complete lexer.(feel free to use it or write your own) It will find all of the keywords that you define in the regex and not confuse them with string literals or comments.

You can do better than trial-and-error by regex. You can make an implicit or explicit state machine in code, or use a single regex with a bunch of options in named capture groups (perhaps pieced together by code rather than written by hand). — siride, Jul 12 '13 at 03:48
Yes, I know. However, I figured the complexity those would add to the sample would increase the learning curve. Therefore it would be better learning material without the optimizations. — AtinSkrita, Jul 12 '13 at 14:58
Your answer is really good as learing material. I've created assambly that highlight source code (to add language/syntax u need only to implement proper interface). Ithink that it's a good base to create efficient solution. — , Jul 14 '13 at 21:22

Match source code keywords

2 Answers2

Description

Example