7

What do I use to search for multiple words in a string? I would like the logical operation to be AND so that all the words are in the string somewhere. I have a bunch of nonsense paragraphs and one plain English paragraph, and I'd like to narrow it down by specifying a couple common words like, "the" and "and", but would like it match all words I specify.

Lance Roberts
  • 22,383
  • 32
  • 112
  • 130
Wes P
  • 9,622
  • 14
  • 41
  • 48

5 Answers5

11

Regular expressions support a "lookaround" condition that lets you search for a term within a string and then forget the location of the result; starting at the beginning of the string for the next search term. This will allow searching a string for a group of words in any order.

The regular expression for this is:

^(?=.*\bword1\b)(?=.*\bword2\b)(?=.*\bword3\b)

Where \b is a word boundary and the ?= is the lookaround modifier.

If you have a variable number of words you want to search for, you will need to build this regular expression string with a loop - just wrap each word in the lookaround syntax and append it to the expression.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • Exactly what I needed. Note there are a couple of asterisks missing above. Each section should be `(?=.*\bword\b)` – Tamlyn Jul 18 '11 at 15:27
  • The asterisks were there, but they were being treated as markup. I fixed it by applying code formatting. – Alan Moore Apr 12 '14 at 22:45
5

AND as concatenation

^(?=.*?\b(?:word1)\b)(?=.*?\b(?:word2)\b)(?=.*?\b(?:word3)\b)

OR as alternation

^(?=.*?\b(?:word1|word2|word3)\b
^(?=.*?\b(?:word1)\b)|^(?=.*?\b(?:word2)\b)|^(?=.*?\b(?:word3)\b)
Markus Jarderot
  • 86,735
  • 21
  • 136
  • 138
2

Firstly I'm not certain what you're trying to return... the whole sentence? The words in between your two given words?

Something like:

\b(word1|word2)\b(\w+\b)*(word1|word2)\b(\w+\b)*\.

(where \b is the word boundary in your language) would match a complete sentence that contained either of the two words or both..

You'd probably need to make it case insensitive so that if it appears at the start of the sentence it will still match

brasskazoo
  • 76,030
  • 23
  • 64
  • 76
  • Doesn't that just match a sentence that contains two words, either word1 followed by word2, or word2 followed by word1 (as desired), or word1 followed by word1, or word2 followed by word2 (as not desired)? That was the sort of problem I ran into when trying to answer. – Jonathan Leffler Oct 17 '08 at 03:20
2

Maybe using a language recognition chart to recognize english would work. Some quick tests seem to work (this assumes paragraphs separated by newlines only).

The regexp will match one of any of those conditions... \bword\b is word separated by boundaries word\b is a word ending and just word will match it in any place of the paragraph to be matched.

my @paragraphs = split(/\n/,$text);
for my $p (@paragraphs) {
    if ($p =~ m/\bthe\b|\band\b|\ban\b|\bin\b|\bon\b|\bthat\b|\bis\b|\bare\b|th|sh|ough|augh|ing\b|tion\b|ed\b|age\b|’s\b|’ve\b|n’t\b|’d\b/) {
       print "Probable english\n$p\n";
    }
}
user115014
  • 922
  • 14
  • 23
Vinko Vrsalovic
  • 330,807
  • 53
  • 334
  • 373
0

Assuming PCRE (Perl regexes), I am not sure that you can do it at all easily. The AND operation is concatenation of regexes, but you want to be able to permute the order in which the words appear without having to formally generate the permutation. For N words, when N = 2, it is bearable; with N = 3, it is barely OK; with N > 3, it is unlikely to be acceptable. So, the simple iterative solution - N regexes, one for each word, and iterate ensuring each is satisfied - looks like the best choice to me.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • Why do the N things have to be regexes though? Could just use "index" here. –  Oct 16 '08 at 22:58
  • 1
    \b(foo|bar|baz)\b.*\b(?!\1)(foo|bar|baz)\b.*\b(?!\1)(?!\2)(foo|bar|baz)\b ought to handle permutations by using back references and negative lookahead to avoid matching a word twice. It's still properly evil, but at least the pattern length isn't O(N!) – stevemegson Oct 16 '08 at 23:19
  • @BKB: I'm not sure what you mean by using an index. – Jonathan Leffler Oct 17 '08 at 03:23
  • @SteveMegson: Yes, I think I see what you're up to - and not being sure of the scope of negative lookahead (a relatively new feature of Perl - since I was really learning it, back in the days of 4.x, and 5.[0-6]), I was not dogmatic in my answer. As you say, not nice, but not combinatorial either. – Jonathan Leffler Oct 17 '08 at 03:25