17

I need to create a regex that can match multiple strings. For example, I want to find all the instances of "good" or "great". I found some examples, but what I came up with doesn't seem to work:

\b(good|great)\w*\b

Can anyone point me in the right direction?

Edit: I should note that I don't want to just match whole words. For example, I may want to match "ood" or "reat" as well (parts of the words).

Edit 2: Here is some sample text: "This is a really great story." I might want to match "this" or "really", or I might want to match "eall" or "reat".

Jon Tackabury
  • 47,710
  • 52
  • 130
  • 168
  • 1
    Do you want to match "oo", "o" or "t", too? – jpalecek Mar 30 '09 at 19:09
  • 4
    What about ooooooooooooooooooooooooooooooooooooooooooooooooooo? – C. Ross Mar 30 '09 at 19:16
  • I found that using "good|great" as the pattern works, is this ok? Why do some people's examples have more markup in them? – Jon Tackabury Mar 30 '09 at 19:16
  • yes, you might also want to translate it or send an e-mail with it. – Diadistis Mar 30 '09 at 19:17
  • I can't see any similarity between your examples. Are you trying to match random substrings within random words? Regular expressions are for matching patterns -- so you'll need to tell us WHY "this", "really", "eall" and "reat" are the correct matches. – ojrac Mar 30 '09 at 19:26
  • I'm trying to find text in a document that matches a list of words. I want to generate a regex pattern from a list of words, then use that to see if there are any of those words in that document. – Jon Tackabury Mar 30 '09 at 19:28
  • Agreed with the others, your information about partial words currently makes no sense. – Chad Birch Mar 30 '09 at 19:28
  • For the partial words, if I'm trying to find "house", I would want to match "houses" as well. So I would use "house" in my regex pattern and match the partial word. – Jon Tackabury Mar 30 '09 at 19:30
  • So what do you want to match? "oo", "eall" or "plusgoodwise"? If the latter, just join the words with |. – jpalecek Mar 30 '09 at 19:32
  • This question would be more valuable to the site if it was made more precise. E.g. why do you say "may" or "might"? Under what conditions do they hold? Also, as far as I do understand the question, the accepted answer is not correct! "Good" in your word list won't produce a match on "ood". – bmm6o Nov 06 '13 at 23:57

6 Answers6

25

If you can guarantee that there are no reserved regex characters in your word list (or if you escape them), you could just use this code to make a big word list into @"(a|big|word|list)". There's nothing wrong with the | operator as you're using it, as long as those () surround it. It sounds like the \w* and the \b patterns are what are interfering with your matches.

String[] pattern_list = whatever;
String regex = String.Format("({0})", String.Join("|", pattern_list));
ojrac
  • 13,231
  • 6
  • 37
  • 39
  • 1
    Possible one mistake: It should be String.Join("|", word_list) rather than String.Join(word_list, "|"), see also http://msdn.microsoft.com/en-us/library/57a79xd0.aspx – David Mar 05 '13 at 09:41
  • 1
    Contrary to the question, it won't match for example the `"ood"` in `"good"`. – MikeM Mar 19 '13 at 18:44
  • @MikeM I don't know why I never noticed that. Removed a bunch of off-topic stuff and focused on the \w and \b. – ojrac Mar 26 '13 at 01:30
  • @ojrac you may want to expand on "escape them" remark (probably just link to some other question). Additionally giving option for word matches may be good idea - something like "if you need to do whole word matches wrap resulting expression with `\b({0})\b`" (the post is easy to find by "c# regex match multiple strings" search and it is not immediately clear if question is scoped to word matches/partial matches). – Alexei Levenkov Sep 08 '15 at 21:53
  • upvoted what if I have say 3000 words to match, does this method still hold? – PirateApp Dec 10 '19 at 11:39
  • My hunch is yes, though you should try to make sure you aren't compiling your regex every time you use it if it's getting to be thousands of characters long. At some point this will break down, though -- at that point, you might need to build a dictionary out of a trie. You can still use regex to split the text into words. But, as always: try the simple thing and profile it first. If you're curious about ways regex engines break down, you might enjoy this read: https://swtch.com/~rsc/regexp/regexp1.html – ojrac Dec 11 '19 at 14:33
4
(good)*(great)*

after your edit:

\b(g*o*o*d*)*(g*r*e*a*t*)*\b
Chris Ballance
  • 33,810
  • 26
  • 104
  • 151
2

I think you are asking for smth you dont really mean if you want to search for any Part of the word, you litterally searching letters

e.g. Search {Jack, Jim} in "John and Shelly are cool"

is searching all letters in the names {J,a,c,k,i,m}

*J*ohn *a*nd Shelly *a*re

and for that you don't need REG-EX :)

in my opinion, A Suffix Tree can help you with that

http://en.wikipedia.org/wiki/Suffix_tree#Functionality

enjoy.

Tomer W
  • 3,395
  • 2
  • 29
  • 44
1

Just check for the boolean that Regex.IsMatch() returns.

if (Regex.IsMatch(line, "condition") && Regex.IsMatch(line, "conditition2"))

The line will have both regex, right.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • The list may have a lot more than two words in it, and this approach doesn't scale well. Also, I think you only need to match one of the words, meaning your `&&` should be `||`. The answer itself had many problems with formatting, syntax and spelling, which I attempted to correct. Please review my changes. – Alan Moore Apr 01 '13 at 17:21
1

I don't understand the problem correctly:

If you want to match "great" or "reat" you can express this by a pattern like:

"g?reat"

This simply says that the "reat"-part must exist and the "g" is optional.

This would match "reat" and "great" but not "eat", because the first "r" in "reat" is required.

If you have the too words "great" and "good" and you want to match them both with an optional "g" you can write this like this:

(g?reat|g?ood)

And if you want to include a word-boundary like:

\b(g?reat|g?ood)

You should be aware that this would not match anything like "breat" because you have the "reat" but the "r" is not at the word boundary because of the "b".

So if you want to match whole words that contain a substring link "reat" or "ood" then you should try:

"\b\w*?(reat|ood)\w+\b"

This reads: 1. Beginning with a word boundary begin matching any number word-characters, but don't be gready. 2. Match "reat" or "ood" enshures that only those words are matched that contain one of them. 3. Match any number of word characters following "reat" or "ood" until the next word boundary is reached.

This will match:

"goodness", "good", "ood" (if a complete word)

It can be read as: Give me all complete words that contain "ood" or "reat".

Is that what you are looking for?

1

I'm not entirely sure that regex alone offers a solution for what you're trying to do. You could, however, use the following code to create a regex expression for a given word. Although, the resulting regex pattern has the potential to become very long and slow:

function wordPermutations( $word, $minLength = 2 )
{
    $perms = array( );

    for ($start = 0; $start < strlen( $word ); $start++)
    {
        for ($end = strlen( $word ); $end > $start; $end--)
        {
            $perm = substr( $word, $start, ($end - $start));

            if (strlen( $perm ) >= $minLength)
            {
                $perms[] = $perm;
            }
        }
    }

    return $perms;
}

Test Code:

$perms = wordPermutations( 'great', 3 );  // get all permutations of "great" that are 3 or more chars in length
var_dump( $perms );

echo ( '/\b('.implode( '|', $perms ).')\b/' );

Example Output:

array
  0 => string 'great' (length=5)
  1 => string 'grea' (length=4)
  2 => string 'gre' (length=3)
  3 => string 'reat' (length=4)
  4 => string 'rea' (length=3)
  5 => string 'eat' (length=3)

/\b(great|grea|gre|reat|rea|eat)\b/
KOGI
  • 3,959
  • 2
  • 24
  • 36