2

I have a variant of exact match that I'm struggling to execute using regex. I would like to match several words (e.g. Apple, Bat, Car) to a string while ignoring order and also being exclusive (i.e. ignoring cases with extra words, or too few words). For example (using the list above), I'd like the following outcomes (true/false):

  • Bat, Car, Apple (True)
  • Car, Bat, Apple (True)
  • Apple, Car, Bat (True)
  • Apple, Car, Bat, Stick (False)
  • Bat, Car (False)
  • Apple (False)

I have tried two things;

(1) lookahead assertions

^(?=.*Apple)(?=.*Bat)(?=.*Car).*
  • Bat, Car, Apple (True)
  • Car, Bat, Apple (True)
  • Apple, Car, Bat (True)
  • Apple, Car, Bat, Stick (True)
  • Bat, Car (False)
  • Apple (False)

This almost works, but allows strings with additional words (e.g. the case with "Stick"). What can I add to exclude these cases, assuming "Stick" can be any other word, and there could be multiple additional words.

(2) Following related Q/A examples on stack overflow

^(Apple|Bat|Car|[,\s])+$
  • Bat, Car, Apple (True)
  • Car, Bat, Apple (True)
  • Apple, Car, Bat (True)
  • Apple, Car, Bat, Stick (False)
  • Bat, Car (True)
  • Apple (True)

Which again almost works, but it incorrectly includes the smaller subsets.

Edit: Note, my list of words to match is just an example, it will be variable and can be any number of words.

oldskooo
  • 23
  • 3
  • 2
    Do you really need a regular experssion? You could split the string into words. If you have 3 items, then you can check, if "Apple", "Bat" and "Car" are elements of the split. Otherwise it is `False`. – mosc9575 Nov 19 '22 at 20:23
  • 1
    @mosc9575, I expect the OP is aware that this task could be performed easily in code without the use of a regular expression. Generally, when the "regex" tag is present the asker either needs a regular expression (as input to other code, for example) or just wants to know if a regular expression could be used, for whatever reason, curiosity being one. – Cary Swoveland Nov 20 '22 at 06:54
  • You are correct @CarySwoveland. – oldskooo Nov 20 '22 at 12:31

3 Answers3

2

Firstly - this is quite a stretched usage of regex, you may be better off using other string functions (depending on language)

Regex: ^(apple|bat|car), (?!\1)(apple|bat|car), (?!\1|\2)(apple|bat|car)$

demo: https://regex101.com/r/Yc8CVj/2

very rough human translation: at the start of line, capture either word, see if next word is different and capture it if it is either of the other two, and then see if last word is the one left and the line ends after it

Features

  • prevents duplicates (apple, apple, car)
  • (according to demo) around 30 steps for match
akash
  • 587
  • 4
  • 16
  • 1
    This is an example where a *subroutine* or *subexpression* can be used to advantage, provided the regex engine supports them, as does PCRE, Python's, Ruby's, R's and others: `^((?Papple|bat|car)), (?!\1)((?P>all)), (?!\1|\2)((?P>all))$`. [Demo](https://regex101.com/r/Cu5oZn/1). Here `(?P>all)` simply tells the regex engine to invoke, at that location, the code used to obtain a match for capture group `all`. As well as shortening the expression, subroutines reduce the chances of errors by avoiding the cutting and pasting duplicate code. – Cary Swoveland Nov 20 '22 at 07:33
  • thank you for sharing this! would i be correct in understanding that its saving an expression as a variable for re use then? amazing - i've never used these but shall be from now on @CarySwoveland – akash Nov 20 '22 at 11:34
  • @CaryWoveland - these solutions are specific to 3 word combinations correct? I'm leaning towards one of the other solutions (e.g. looking at the length or number of words, or using a singular expression) given this wouldn't expand very well to larger word combinations? – oldskooo Nov 20 '22 at 12:39
  • All I can say is that if you change `apple|bat|car` in `(?Papple|bat|car)` to anything that makes sense to the regex engine `(?P>all)` will invoke that code at its location in the expression. See the fuller explanation [here](https://www.regular-expressions.info/subroutine.html). – Cary Swoveland Nov 21 '22 at 00:23
  • akash, yes, that's the functionality. As you see from the link in my comment immediately above it works with numbered captured groups as well. I suggested a named group here to avoid confusion with the numbered groups. – Cary Swoveland Nov 21 '22 at 00:31
  • @CarySwoveland I don't think my point was understood. If you add a word to the list (e.g. going from 3 to 4), the solution does not behave as expected because you need to add the additional expressions for checking more words. This is not ideal in my case. See example: https://regex101.com/r/48veSl/1 – oldskooo Nov 21 '22 at 12:32
  • As I understand your point we would need [this](https://regex101.com/r/1OfpyV/1) for your example with four words. Regretfully, I don't think you'll find any regex that allows you to change the collection of words of interest in just one place. – Cary Swoveland Nov 21 '22 at 20:28
2

An idea is to just check for exactly three words after the lookaheads:

^(?=.*?\bApple\b)(?=.*?\bBat\b)(?=.*?\bCar\b)\w+(?:, ?\w+){2}$

See this demo at regex101 - I further added \b word boundaries around the words.
\w matches word characters, used , ? comma and optional space between words.


Another variant by capturing and failing if the same word is ahead:

^(?:\b(?:, ?)?(Apple|Bat|Car)\b(?!.*?\b\1\b)){3}$

Regex101 demo - The optional separator depends on \b in this one.

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
0

Try:

(?=.*Apple)(?=.*Car)(?=.*Bat)(?!.*(?:,|^)(?:(?!Apple|Bat|Car).)+(?:,|$))^.*$

Regex demo.


(?=.*Apple)(?=.*Car)(?=.*Bat) - we want to match line where Apple, Car and Bat is found

(?!.*(?:,|^)(?:(?!Apple|Bat|Car).)+(?:,|$)) - we don't want to match line where other word is found. Word is between commas ,, and/or start/end line

^.*$ - we want to match the whole line


EDIT: Regex with word boundaries \b (to not match Cartography for example):

(?=.*\bApple\b)(?=.*\bCar\b)(?=.*\bBat\b)(?!.*(?:,|^)(?:(?!\b(?:Apple|Bat|Car)\b).)+(?:,|$))^.*$

Regex demo.

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91