4

I am trying to exclude a group of words but include another group of words in a qregexp expression but I am currently having issues figuring this out.

Here are some of the things I tried (this example included all of the words):

(words|I|want|to|include)(?!the|ones|that|should|not|match)

So I tried this (which returned nothing):

^(words|I|want|to|include)(?:(?!the|ones|that|should|not|match).)*$

Am I missing something?

Edit: The reason why I need such an unusual regex (include/exclude) is because I want to search through a series of articles and filter the ones that have the included words in them but not if they also have the excluded words in them.

So for example if article A is:

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

and article B is:

Vivamus fermentum semper porta.

Then a regex that includes lorem would filter article A but not B. But if ipsum is a word that I'm excluding, I do not want article A to be filtered.

I considered doing a regex to filter out the articles with the words that I want and then run a second regex excluding articles from the first set that I do not want, but unfortunately the software I am using does not allow me to do this. I can only run one regular expression.

thequerist
  • 1,774
  • 3
  • 19
  • 27
  • That doesn't make sense. You're explicitly enumerating the words you want to match ("include"). There's no need to "exclude" anything afterwards; you already know what's on your whitelist. – melpomene Aug 15 '15 at 15:34
  • What do you mean by include? at least one word from the list? – Avinash Raj Aug 15 '15 at 15:41
  • I use this RSS software (QuiteRSS) that allows me to filter out articles using qregexp that contain certain words. However, I do not want the articles that contain these words to be filtered if the words in the negative lookahead are also in the article. – thequerist Aug 15 '15 at 15:50
  • 3
    Possible duplicate of https://xkcd.com/1313/ or http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313.ipynb – Kristján Aug 15 '15 at 16:09
  • All the answers are the same. –  Aug 18 '15 at 15:56

5 Answers5

4

I think there is no need in a tempered greedy quantifier. Use excluded words as alternatives inside an anchored negative look-ahead. Let me guide you through this.

You say, you have Lorem ipsum dolor sit amet, consectetur adipiscing elit., and you want it to match since it contains the word lorem. The regex is \\blorem\\b (with QRegExp.CaseInsensitive set to 1) where \b is used to force whole word matching. To prevent the match in case the string contains the word ipsum, you need to use the lookahead at the very beginning of the string.

^(?!.*\\bipsum\\b).*\\blorem\\b

Now, it does not match the string in question.

To add more alternatives, we can use an alternation operator |, and we can do it like this: ^(?!.*\\b(?:words|to|exclude)\\b).*\\b(?:words|to|include)\\b. Note the use of non-capturing groups, it does not store any captured texts and potentially improves performance as compared to capturing groups that save the matched text in a buffer.

Thus, you get

^(?!.*\\b(?:the|ones|that|should|not|match)\\b).*\\b(?:words|I|want|to|include)\\b

See demo

Two remarks:

  1. At the demo Web site, single backslashes must be used, I am doubling them here for the QRegExp.
  2. In Qt, . in the pattern matches any character including a newline. At the demo Web site, the dot does not match newline symbols. You may want to replace it with [^\n] if you need the same functionality, but I think it is not necessary.
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
3
^(?:(?!\b(?:the|ones|that|should|not|match)\b).)*\b(?:words|I|want|to|include)\b(?:(?!\b(?:the|ones|that|should|not|match)\b).)*$

You need to add lookahead to both parts after you find words whcih should match.See demo.

https://regex101.com/r/bK9wF1/3

or

^(?!.*\b(?:the|ones|that|should|not|match)\b)(?=.*\b(?:words|I|want|to|include)\b).*$

Add both conditions under lookaheads.See demo.

https://regex101.com/r/uF4oY4/60

vks
  • 67,027
  • 10
  • 91
  • 124
2

You were so close. The reason

^(words|I|want|to|include)(?:(?!the|ones|that|should|not|match).)*$

doesn't work is because it means start with one of the words that I want to include and continue til the end with things, which are not one of the words that I don't want to include. To fix it, you can simply change the starting check to use positive lookahead:

^(?=.*(?:words|I|want|to|include))(?:(?!the|ones|that|should|not|match).)*$

Now this means ensure that from the beginning til some point, there is at least one of the words that I want to include and then continue as in the original regex.

To make it even more strict, you could use word boundaries:

^(?=.*\b(?:words|I|want|to|include)\b)(?:(?!\b(?:the|ones|that|should|not|match)\b).)*$

Note that these are all case sensitive. To change that, you can use QRegExp::setCaseSensitivity

ndnenkov
  • 35,425
  • 9
  • 72
  • 104
1

Try this:

^(?:(?:(?!\b(?:the|ones|that|should|not|match)\b).|))*?\b(?:words|I|want|to|include)\b(?:(?:(?!\b(?:the|ones|that|should|not|match)\b).|))*$

Regular expression visualization

See Debuggex Demo (with matching and non-matching examples).

Note: The above assumes QRegExp supports variable-length lookahead - I haven't verified this.

Explanation:

  1. All words must be exact (e.g. include "word" but not "sword" or "words") so are wrapped in \b either side.
  2. For the words you want to include it only matters that at least one of the appears at least once - so that is all that is being searched for in.
  3. None of the words in the exclude list may appear before or after the searched for word, hence need an "exclusion group" either side of it.
  4. Exclusion groups are implemented using a method that is explained very well in this answer.
  5. The first exclusion group uses *? to make it non-greedy so it doesn't consume the whole text and stops as soon as the searched for word is found.
  6. The regular expression is wrapped in ^...$ to ensure the whole string is checked/matched, not just part of it.
  7. All groups are marked as non-capturing groups by using ?: immediately after the first parenthesis.
  8. The matching should presumably be case insensitive so the regular expression should have the appropriate flag to do this (e.g. /i).
Community
  • 1
  • 1
Steve Chambers
  • 37,270
  • 24
  • 156
  • 208
0

A simplified version of what you seem to need:

^(?:(?!ipsum).)*(?:lorem)(?:(?!ipsum).)*$

Formatted:

^                    # BOS
 (?:
      (?! ipsum )          # Preceding text, but not these words
      . 
 )*
 (?: lorem )          # Text wanted
 (?:
      (?! ipsum )          # Following text, but not these words
      . 
 )*
 $                    # EOS