2

i'm creating a blacklist of keywords which I want to check for in text files, however, i'm having trouble finding any regex documentation which will help me with the following issue.

I have a set of blacklisted keywords:

welcome, goodbye, join us

I want to check some text files for any matches. I'm using the following regex to match exact words and also the pluralized version.

string.Format(@"\b{0}s*\b", keyword)

However, I've run into an issue matching keywords with two words and a any character in between. The above regex matches 'join us', but I need to match 'join@us' or 'join_us' for example as well.

Any help would be greatly appreciated.

Thomas Ayoub
  • 29,063
  • 15
  • 95
  • 142
CinnamonBun
  • 1,150
  • 14
  • 28

3 Answers3

5

I thing, that the "any character in between" may cause you a lot of troubles. For example let's consider this:

We want to find "my elf"... but you probably don't want to match "myself".

Anyway. If this is OK with you replace space character with dot in the keyword using string.Replace.

. in regex will match any character.

If you are new to regexes, check this useful cheat sheet: http://www.mikesdotnetting.com/article/46/c-regular-expressions-cheat-sheet

To solve the issue with "myself" and "my elf", use something more careful than . in the regex. For example [^a-zA-Z] which will match anything except letters from a to z and A to Z, or maybe \W, which will match non-word character, which means anything except a-zA-Z0-9_, so it is equivalent to [^a-zA-Z0-9_].

Also be careful about plural forms like city - cities and all the irregular ones.

Matyas
  • 1,122
  • 5
  • 23
  • 29
0

If you're set on using pluralization, you will have to use the PluralizationService (see this answer for more details).

And seeing that you're using a string.Format, I assume you're looping your backlist array.

So why not do it all in a neat method?

public static string GetBlacklistRegexString(string[] blacklist)
{
    //It seems that this service only support engligh natively, to check later
    var ps = PluralizationService.CreateService(CultureInfo.GetCultureInfo("en"));

    //Using a StringBuilder for ease of use and performance,
    //even though it's not easy on the eye :p
    StringBuilder sb = new StringBuilder().Append(@"\b(");

    //We're just going to make a unique regex with all the words
    //and their plurals in a list, so we're looping here
    foreach (var word in blacklist)
    {
        //Using a dot wasn't careful indeed... Feel free to replace
        //"\W" with anything that does it for you. It will match
        //any non-alphanumerical character
        var regexPlural = ps.Pluralize(word).Replace(" ", @"\W");
        var regexWord = word.Replace(" ", @"\W");

        sb.Append(regexWord).Append('|').Append(regexPlural).Append('|');
    }
    sb.Remove(sb.Length - 1, 1); //removing the last '|'
    sb.Append(@")\b");
    return sb.ToString();
}

The usage is nothing surprising if you're already using regular expressions in .NET:

static void Main(string[] args)
{
    string[] blacklist = {"Goodbye","Welcome","join us"};
    string input = "Welcome, come join us at dummywebsite.com for fun and games, goodbye!";

    //I assume that you want it case insensitive
    Regex blacklistRegex = new Regex(GetBlacklistRegexString(blacklist), RegexOptions.IgnoreCase);

    foreach (Match match in blacklistRegex.Matches(input))
    {
        Console.WriteLine(match);
    }

    Console.ReadLine();
}

We get written on the console the expected output:

  • Welcome
  • join us
  • goodbye

Edit: still have a problem (working on it later), if "man" is in your keywords, it will match the "men" in "women"... Weirdly I don't get this behaviour on regexhero.

Edit 2: duh, of course if I don't group the words with parenthesis, the word boundaries are just applied to the first and last one... Corrected.

Community
  • 1
  • 1
Kilazur
  • 3,089
  • 1
  • 22
  • 48
0

You could try something like this (I left only the {0} part of the regex):

var relevantChars = new char[]{',', '@'}; // add here anything you like
string.Format(@"{0}", keyword.Replace(" ", "(" + string.Join("|", relevantChars ) + ")"));
Maor Veitsman
  • 1,544
  • 9
  • 21