3

I have a list of search terms and I would like to have a regex that matches all items that have at least two of them.

Terms: war|army|fighting|rebels|clashes

Match: The war between the rebels and the army resulted in several clashes this week. (4 hits)

Non-Match: In the war on terror, the obama administration wants to increase the number of drone strikes. (only 1 hit)

Background: I use tiny-tiny rss to collect and filter a large number of feeds for a news reporting project. I get 1000 - 2000 feed items per day and would like to filter them by keywords. By just using |OR expression, I get to many false positives, so I figured I could just ask for two matches in a feed item.

Thanks!

EDIT:

I know very little about regex, so I stuck with using the simple |OR operator so far. I tried putting the search terms in parenthesis (war|fighting|etc){2,}, but that only matches if an item uses the same word twice.

EDIT2: sorry for the confusion, I'm new to regex and the like. Fact is: the regex queries a mysql database. It is entered in the tt-rss backend as a filter, which allows only one line (although theoretically unlimited number of characters). The filter is employed upon importing of the feed item into the mysql database.

user1428228
  • 47
  • 1
  • 4
  • 1
    possible duplicate of [Regex to match string containing two names in any order](http://stackoverflow.com/questions/4389644/regex-to-match-string-containing-two-names-in-any-order). Depending on the language you're using it might be (a lot) easier to just loop on the words and check if they exist in the string - bailing when you find 2 matches. – AD7six May 31 '12 at 11:17
  • What language are you doing this in? What have you tried? – ghoti May 31 '12 at 11:20
  • 1
    People are answering because it's an interesting question, but the *quality* of the question needs improvement. Please tag your question with a language, and show any steps you've already tried. – Todd A. Jacobs May 31 '12 at 11:38

4 Answers4

9
(.*?\b(war|army|fighting|rebels|clashes)\b){2,}

If you need to avoid matching the same term, you can use:

.*?\b(war|army|fighting|rebels|clashes).*?(\b(?!\1)(war|army|fighting|rebels|clashes)\b)

which matches a term, but avoids matching the same term again by using a negative lookahead.

In java:

Pattern multiword = Pattern.compile(
    ".*?(\\b(war|army|fighting|rebels|clashes)\\b)" +
    ".*?(\\b(?!\\1)(war|army|fighting|rebels|clashes)\\b)"
);
Matcher m;
for(String str : Arrays.asList(
        "war",
        "war war war",
        "warm farmy people",
        "In the war on terror rebels eating faces"

)) {
    m = multiword.matcher(str);
    if(m.find()) {
        logger.info(str + " : " + m.group(0));
    } else {
        logger.info(str + " : no match.");
    }
}

Prints:

war : no match.
war war war : no match.
warm farmy people : no match.
In the war on terror rebels eating faces : In the war on terror rebels
beerbajay
  • 19,652
  • 6
  • 58
  • 75
  • Hmm, true, the question is a bit unclear on whether that is a requirement or not. Might be possible to avoid that by using backreferences. – beerbajay May 31 '12 at 11:26
  • I can't get that regex to work - but if it does work that's excellent. No word boundary in the regex though, so it'll match text containing e.g. "warm farmy people" – AD7six May 31 '12 at 11:41
  • Query SELECT DISTINCT date_entered, guid, ttrss_entries.id,ttrss_entries.title, updated, label_cache, tag_cache, always_display_enclosures, site_url, note, num_comments, comments, int_id, unread,feed_id,marked,published,link,last_read,orig_feed_id, SUBSTRING(last_read,1,19) as last_read_noms, ttrss_feeds.title AS feed_title, content as content_preview, SUBSTRING(updated,1,19) as updated_noms, author,score FROM ttrss_entries,ttrss_user_entries,ttrss_feeds WHERE ttrss_user_entries.feed_id = ttrss_feeds.id AND ttrss_user_entries.ref_id = ttrss_entries.id AND – user1428228 May 31 '12 at 12:11
  • ttrss_user_entries.owner_uid = '1' AND (LOWER(ttrss_entries.title) REGEXP LOWER('(.*?\\b(.krieg.|konflikt.|.k.mpf.|.töt.|frieden.|feuerpause|waffen.|panzer|.gewehr.|.miliz.|armee|rebell.|aufstand|terror.)\\b){2,}') OR LOWER(ttrss_entries.content) REGEXP LOWER('(.*?\\b(.krieg.|konflikt.|.k.mpf.|.töt.|frieden.|feuerpause|waffen.|panzer|.gewehr.|.miliz.|armee|rebell.|aufstand|terror.)\\b){2,}')) AND ttrss_entries.date_entered > DATE_SUB(NOW(), INTERVAL 14 DAY) AND cat_id = '2' ORDER BY date_entered DESC LIMIT 30 OFFSET 0 failed: Got error 'repetition-operator operand invalid' from regexp – user1428228 May 31 '12 at 12:13
1

This isn't (entirely) a job for regular expressions. A better approach is to scan the text, and then count the unique match groups.

In Ruby, it would be very simple to branch based on your match count. For example:

terms = /war|army|fighting|rebels|clashes/
text = "The war between the rebels and the army resulted in..."

# The real magic happens here.
match = text.scan(terms).uniq

# Do something if your minimum match count is met.
if match.count >= 2
  p match
end

This will print ["war", "rebels", "army"].

Todd A. Jacobs
  • 81,402
  • 15
  • 141
  • 199
0

Regular expressions could do the trick, but the regular expression would be quite huge.

Remember, they are simple tools (based on finite-state automata) and hence don't have any memory that would let them remember what words were already seen. So such regex, even though possible, would probably just look like a huge lump of or's (as in, one "or" for every possible order of inputs or something).

I recommend to do the parsing yourself, for instance like:

var searchTerms = set(yourWords);
int found = 0;
foreach (var x in words(input)) {
    if (x in searchTerms) {
        searchTerms.remove(x);
        ++found;
    }
    if (found >= 2) return true;
}
return false;
Kos
  • 70,399
  • 25
  • 169
  • 233
0

If you want to do it all with a regex it's not likely to be easy.

You can however do something like this:

<?php
...
$string = "The war between the rebels and the army resulted in several clashes this week. (4 hits)";


preg_match_all("@(\b(war|army|fighting|rebels|clashes))\b@", $string, $matches);
$uniqueMatchingWords = array_unique($matches[0]);
if (count($uniqueMatchingWords) >= 2) {
    //bingo
}
AD7six
  • 63,116
  • 12
  • 91
  • 123