Simple, use a lexer. A lexer finds groups of text in a string and generates tokens from those groups. The tokens are then provided with a "type". (Something to define what it is)
A C# keyword is one of the defined C# keywords.
A simple Regular expression for this would define borders followed by one of the possible C# keywords. ("\b(new|var|string|...)\b"
)
Your lexer will find all of the matches in a given string for keywords, create a token for each match, and say that the token "type"
is "keyword"
.
However, like you say, you do not want to find keywords inside quotes, or comments.
This is where the lexer really earns its points.
To resolve this case a (regex-based)lexer would use two methods:
- Remove all of the matches contained by another match.
- Remove a match that uses the same space as another but has a lower
priority.
A lexer works in the following steps:
- Find all of the matches from the regexes
- Convert them to tokens
- Order the tokens by index
- Loop through each of the tokens comparing
the current match with the next match,
if the next match is partially contained by this match
(or if they both occupy the same space) remove it.
Spoiler Alert
Below is a fully functional lexer. It will demonstrate how a lexer works, because it is a fully functional lexer.
For Example:
Given regexes for strings, comments, and keywords, show how a lexer resolves conflicts between them.
//Simple Regex for strings
string StringRegex = "\"(?:[^\"\\\\]|\\\\.)*\"";
//Simple Regex for comments
string CommentRegex = @"//.*|/\*[\s\S]*\*/";
//Simple Regex for keywords
string KeywordRegex = @"\b(?:new|var|string)\b";
//Create a dictionary relating token types to regexes
Dictionary<string, string> Regexes = new Dictionary<string, string>()
{
{"String", StringRegex},
{"Comment", CommentRegex},
{"Keyword", KeywordRegex}
};
//Define a string to tokenize
string input = "string myString = \"Hi! this is my new string!\"//Defines a new string.";
//Lexer steps:
//1). Find all of the matches from the regexes
//2). Convert them to tokens
//3). Order the tokens by index then priority
//4). Loop through each of the tokens comparing
// the current match with the next match,
// if the next match is partially contained by this match
// (or if they both occupy the same space) remove it.
//** Sorry for the complex LINQ expression (not really) **
//Match each regex to the input string(Step 1)
var matches = Regexes.SelectMany(a => Regex.Matches(input, a.Value)
//Cast each match because MatchCollection does not implement IEnumerable<T>
.Cast<Match>()
//Select a new token for each match(Step 2)
.Select(b =>
new
{
Index = b.Index,
Value = b.Value,
Type = a.Key //Type is based on the current regex.
}))
//Order each token by the index (Step 3)
.OrderBy(a => a.Index).ToList();
//Loop through the tokens(Step 4)
for (int i = 0; i < matches.Count; i++)
{
//Compare the current token with the next token to see if it is contained
if (i + 1 < matches.Count)
{
int firstEndPos = (matches[i].Index + matches[i].Value.Length);
if (firstEndPos > matches[(i + 1)].Index)
{
//Remove the next token from the list and stay at
//the current match
matches.RemoveAt(i + 1);
i--;
}
}
}
//Now matches contains all of the right matches
//Filter the matches by the Type to single out keywords from comments and
//string literals.
foreach(var match in matches)
{
Console.WriteLine(match);
}
Console.ReadLine();
That is a valid(I tested it) almost-complete lexer.(feel free to use it or write your own) It will find all of the keywords that you define in the regex and not confuse them with string literals or comments.