82

As the title says , I need to find two specific words in a sentence. But they can be in any order and any casing. How do I go about doing this using regex?

For example, I need to extract the words test and long from the following sentence whether the word test comes first or long comes.

This is a very long sentence used as a test

UPDATE: What I did not mention in the first part is that it needs to be case insensitive as well.

Will Lanni
  • 923
  • 12
  • 25
RC1140
  • 8,423
  • 14
  • 48
  • 71
  • Do you care about multiple occurrences of the words? Do you know what words you want to extract, or are you wanting to match words that fit a particular pattern? Do you want to find out what position they're at? – Dominic Rodger Jul 24 '09 at 11:34
  • I know the exact words , dont car about multiples , dont need the position. I do need to be case insensitive – RC1140 Jul 24 '09 at 12:00
  • Which regex flavor are you using? JavaScript, .NET, PHP...? And how important is performance? Are you working with very large strings, or doing a great many matches? Several viable answers have been posted already, but none of them is particularly efficient. – Alan Moore Jul 24 '09 at 22:01
  • I think the most important thing (which i found out today) is that it is doing the checks in .Net , so i am not sure if all the answers below apply , i have tried all and sadly .net does not pick up any as case insensitive – RC1140 Jul 27 '09 at 06:51
  • 1
    Ehh, whether it's case sensitive or not should not be dependent on regex. You're better off with programming the software to be case insensitive. However, to recognize multiple words in any order using regex, I'd suggest the use of **quantifier** in regex: `(\b(james|jack)\b.*){2,}`. Unlike **lookaround** or **mode modifier**, this works in most regex flavours. – XPMai Jun 01 '20 at 10:45

8 Answers8

57

You can use

(?=.*test)(?=.*long)

Source: MySQL SELECT LIKE or REGEXP to match multiple words in one record

Community
  • 1
  • 1
velop
  • 3,102
  • 1
  • 27
  • 30
  • 7
    This fails. https://regex101.com/r/Z8KOLp/2 – ChrisJJ Dec 07 '16 at 22:36
  • @ChrisJJ the mentioned regexp does not return the words test or long. But it returns matches if it found the words test or long in any order and returns non matches if it doesn't. – velop Dec 10 '16 at 21:04
  • 3
    Sure, but the requirement is to find/extract the words, not simply to test for them. – ChrisJJ Dec 11 '16 at 12:35
  • 6
    Another variant: [`(?is)^(?=.*\b(test)\b)(?=.*?\b(long)\b).*`](https://regex101.com/r/CPNu4y/1) which also captures the words and matches all string. Further anchored to `^` start which improves performance considerably. `\b` matches a *word boundary*. – bobble bubble Jul 20 '17 at 16:18
  • is there any way to match any or both of these two words?. e.g.if used that regexp with the text "This is a very long sentence" , the long word will not be found. It would be good to add an optional modifier. Is it possible? – Joniale Nov 13 '18 at 08:39
  • Ok for my previous comment i guess just \b(test)|\b(long) will be enough – Joniale Nov 13 '18 at 08:45
  • If the answers regex is too slow, you can add ^ to the start of it to considerably increase performance as said in bobblebubble's answer. – Maarten Jan 31 '23 at 14:09
  • MariaDB word boundaries are `[[:<:]]` and `[[:>:]]`. So instead of `\b(test)\b` you do; `[[:<:]](test)[[:>:]]` – Maarten Jan 31 '23 at 14:38
38

Use a capturing group if you want to extract the matches: (test)|(long) Then depending on the language in use you can refer to the matched group using $1 and $2, for example.

Eric Leschinski
  • 146,994
  • 96
  • 417
  • 335
Paul Lydon
  • 958
  • 6
  • 6
  • I used this answer in conjunction with the (?i) from the answer below, This resulted in the following out put (?i)(test(long)?) because it turns out i had to test for test first and then long. If it is the correct way is another story but it worked for me – RC1140 Jul 27 '09 at 09:05
  • 2
    Given the requirement is to match test AND long, this solution needs the g flag. – ChrisJJ Dec 07 '16 at 22:04
14

I assume (always dangerous) that you want to find whole words, so "test" would match but "testy" would not. Thus the pattern must search for word boundaries, so I use the "\b" word boundary pattern.

/(?i)(\btest\b.*\blong\b|\blong\b.*\btest\b)/
Paul Chernoch
  • 5,275
  • 3
  • 52
  • 73
8

without knowing what language

 /test.*long/ 

or

/long.*test/

or

/test/ && /long/
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
4

Try this:

/(?i)(?:test.*long|long.*test)/

That will match either test and then long, or long and then test. It will ignore case differences.

Daniel C. Sobral
  • 295,120
  • 86
  • 501
  • 681
3

Vim has a branch operator \& that allows an even terser regex when searching for a line containing any number of words, in any order.

For example,

/.*test\&.*long

will match a line containing test and long, in any order.

See this answer for more information on usage. I am not aware of any other regex flavor that implements branching; the operator is not even documented on the Regular Expression wikipedia entry.

Firstrock
  • 931
  • 8
  • 5
2

I was using libpcre with C, where I could define callouts. They helped me to easily match not just words, but any subexpressions in any order. The regexp looks like:

(?C0)(expr1(?C1)|expr2(?C2)|...|exprn(?Cn)){n}

and the callout function guards that every subexpression is matched exactly once,like:

int mycallout(pcre_callout_block *b){
static int subexpr[255];
if(b->callout_number == 0){
    //callout (?C0) - clear all counts to 0
    memset(&subexpr,'\0',sizeof(subexpr));
    return 0;
}else{
    //if returns >0, match fails
    return subexpr[b->callout_number-1]++;
}
}

Something like that should be possible in perl as well.

Juraj
  • 860
  • 4
  • 16
-9

I don't think that you can do it with a single regex. You'll need to d a logical AND of two - one searching for each word.

phatmanace
  • 4,671
  • 3
  • 24
  • 29