2

I can't figure out how to get a C# regex IsMatch to match a <keyword> followed by an end of line or whitespace.

I currently have [\s]+keyword[\s]+ which works for spaces, but does not work for keyword<end of string> or <start of string>keyword.

I have tried [\s^]+keyword[\s$]+, but this makes it fail to match with the spaces, and doesn't work at the end or start of a string.

Here's the code I tried:

string pattern = string.Format("[\\s^]+{0}[\\s$]+",keyword);
if(Regex.IsMatch(Text, pattern, RegexOptions.IgnoreCase))
slhck
  • 36,575
  • 28
  • 148
  • 201
f1wade
  • 2,877
  • 6
  • 27
  • 43

4 Answers4

9

The problem is that ^ and $ inside character classes are not treated as anchors but as literal characters. You could simply use alternation instead of a character class:

string pattern = string.Format(@"(?:\s|^){0}(?:\s|$)",keyword);

Note that there is no need for the +, because you just want to make sure if there is one space. You don't care if there are more of them. The ?: is just good practice and suppresses capturing which you don't need here. And the @ makes the string a verbatim string, where you don't have to double-escape your backslashes.

There is another way, which I find slightly neater. You can use lookarounds, to ensure that there is not a non-space character to left and right of your keyword (yes, double negation, think about it). This assumption is valid if there is a space-character or if there is one end of the string:

string pattern = string.Format(@"(?<!\S){0}(?!\S)",keyword);

This does exactly the same, but might be slightly more efficient (you'd have to profile that to be certain, though - if it even matters).

You can also use the first pattern (with non-inverted logic) with (positive) lookarounds:

string pattern = string.Format(@"(?<=\s|^){0}(?=\s|$)",keyword);

However, this doesn't really make a difference to the first pattern, unless you want to find multiple matches in a string.

By the way, if your keyword might contain regex meta-characters (like |, $, + and so on), make sure to escape it first using Regex.Escape

Patrik Svensson
  • 13,536
  • 8
  • 56
  • 77
Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • would your look ahead / behind match if there was non white space characters ahead and behind, and then if the keyword was not there. – f1wade Apr 25 '13 at 11:32
  • @f1wade I'm not sure what you mean. It simply fulfills your specification - it matches `keyword` if it is immediately surrounded by spaces or the end of the string. It will also work with longer strings, where `keyword` is just one of the words. – Martin Ender Apr 25 '13 at 11:34
  • as you used \S instead of \s the uppercase version matches non whitespace i think? – f1wade Apr 25 '13 at 11:40
  • 1
    @f1wade oh I see what you mean now. I had a typo there. The first lookaround snippet was supposed to use negative lookarounds. If their contents match they cause the pattern to fail. This is what I meant by double negation. – Martin Ender Apr 25 '13 at 11:41
  • thanks for running that through with me, my code now looks like this string pattern = string.Format(@"(?<=\s|^){0}(?=\s|$)", Regex.Escape(keyword)); if(Regex.IsMatch(UserText, pattern, RegexOptions.IgnoreCase)) – f1wade Apr 25 '13 at 11:50
1

I am not exactly sure what you are really trying to accomplish with this regex but the following code will match the the string 'keyword' when it has white space on either side of it:

string resultString = null;
try {
    Regex regexObj = new Regex(@"\b(keyword)\b");
    resultString = regexObj.Match(subjectString).Value;
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}

It can be generally explained as: the \b asserts the position at the beginning and end word boundaries. In this case I assumed the word of interest was keyword.

I also thought from my interpretation of your question that you might be interested in matching the entire series of characters that follow the keyword up to the line break. If that is the case the following regex code will return that match:

string resultString = null;
try {
    Regex regexObj = new Regex(@"\bkeyword\b(\w*\s*)$");
    resultString = regexObj.Match(subjectString).Value;
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}

This regular expression can be interpreted as find the beginning and ending word boundaries which is the reason for the \b on either side. The (\w*\s*)$ reads like this match all word \w characters and space characters \s* as many times as they occur and move position to the end of the line $.

This next bit of code will read in the entire line of data that contains the keyword, lines of data that do not contain the keyword will not match.

string resultString = null;
try {
    Regex regexObj = new Regex("^.*keyword.*$");
    resultString = regexObj.Match(subjectString).Value;
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}

Explained: the ^ positions at the beginning of the string, the .* matches any character that is not a line break character, the keyword is then included followed by the .* so the remaining non line breaking characters are included and the $ asserts the position at the end of the string which would be the entire line in this example.

I hope the above is helpful if not this time maybe in the future. I am always trying to discover alternative practices to achieve the same results, so if you have any constructive criticism please post it.

Best wishes, Steve

Steve Kinzey
  • 373
  • 2
  • 9
0

Try this:

string pattern = string.Format("^\\s*{0}\\s*$",keyword);
Victor Mukherjee
  • 10,487
  • 16
  • 54
  • 97
  • 1
    i dont think that would allow other words between start of line and the keyword. likewise the end of line – f1wade Apr 25 '13 at 11:25
0

i found this other post How to specify "Space or end of string" and "space or start of string"?

and that answered the question so my code is now

string pattern = string.Format("\\b+{0}\\b+",keyword);
if(Regex.IsMatch(UserText, pattern, RegexOptions.IgnoreCase))
Community
  • 1
  • 1
f1wade
  • 2,877
  • 6
  • 27
  • 43
  • 2
    You should not that the `+` is completely unnecessary. `\b` does not match a character, but a position, so it doesn't advance the engine's "cursor". `\b` is therefore exactly the same as `\b\b\b`. Also, this will match your keyword if it occurs like `some string:keyword.`, because `\b` matches between word characters (`[a-zA-Z0-9_]`... in .NET possible some more Unicode characters) and non-word characters. If you really want to restrict it to spaces, have a look at my answer. – Martin Ender Apr 25 '13 at 11:30
  • so \b matches any non char i.e. not[a-zA-Z0-9_]? – f1wade Apr 25 '13 at 11:38
  • 1
    My point is, [it doesn't match any character at all, it matches a position.](http://www.regular-expressions.info/wordboundaries.html). For example if your input string is `a-c` (where `-` is a non-word character), then the pattern `\ba\b` will match **only** the `a`. The `-` is not part of the match because `\b` just checks a position for the two adjacent characters, without actually including them in the match. That also means that in this example `a\b-` (which is a slightly pointless pattern) would give you a match. – Martin Ender Apr 25 '13 at 11:44