3

I am trying to build a 'simple' regular expression (in java) that matches sentences like these:

I want to cook something
I want to cook something with chicken and cheese
I want to cook something with chicken but without onions
I want to cook something without onions but with chicken and cheese
I want to cook something with candy but without nuts within 30 minutes

In best case it should also match: I want to cook something with candy and without nuts within 30 minutes

In those examples I want to capture 'included' ingredients, 'excluded' ingredients and the max 'duration' for the cooking procedure. As you can see each of those 3 capturing groups is optional in the pattern, each is starting with a specific word (with, (but )?without, within) and the groups should match using wildcards UNTIL the next of those specifics keywords is found. Additionally those ingredients can contain several words, so in the second/third example "chicken and cheese" should be matched to the named capturing group 'included'.

In best case I would like to write a pattern similar to this one:

I want to cook something ((with (?<include>.+))|((but )?without (?<exclude>.+))|(within (?<duration>.+) minutes))*

Apparently this does not work because those wildcards can also match to the keywords so after the first keyword got matched everything else (including further keywords) will be matched by the greedy wildcard of the corresponding named capturing group.

I tried working with lookahead, for example something like this:

something ((with (?<IncludedIngredients>.*(?=but)))|(but )?without (?<ExcludedIngredients>.+))+

That regex recognizes something with chicken but without onions but does not match to something with chicken.

Is there a simple solution to do this in regular expressions?

P.S. 'Simple' solution means that I don't have to specify all possible combinations of those keywords in a sentence and order them by the amount of keywords being used in each combination.

DataWorm
  • 73
  • 5
  • 2
    Hard to tell what's meant. No such thing as greedy until the next keyword in regex. `abc.*(?=abc)` does not work. Something that would work is `(?:abc(?:(?!abc).)*` –  Nov 13 '19 at 22:24
  • 1
    What I would suggest is ensuring that the line begins with `I want to cook something` and removing it from the line, then splitting the remaining string on your keywords (or regexes). Something like `(?=with(?:in)?|(?:but )?without)`. Then extracting what you need. P.S. What @x15 suggested is the best way to go about it if you're trying to employ a mostly regex approach. The pattern that was created at the end of the comment uses what's called a [tempered greedy token](https://stackoverflow.com/a/37343088/3600709) – ctwheels Nov 13 '19 at 22:40
  • @x15 the opposite of greedy is reluctant. When you want the match to stop at the next keyword, you just don’t want greedy. Using a reluctant quantifier is much simpler than combining greedy with negative look-ahead. – Holger Nov 14 '19 at 12:57

2 Answers2

2

It can probably be boiled down to the construct below.

(?m)^I[ ]want[ ]to[ ]cook[ ]something(?=[ ]|$)(?<Order>(?:(?<with>\b(?:but[ ])?with[ ](?:(?!(?:\b(?:but[ ])?with(?:in|out)?\b)).)*)|(?<without>\b(?:but[ ])?without[ ](?:(?!(?:\b(?:but[ ])?with(?:in|out)?\b)).)*)|(?<time>\bwithin[ ](?<duration>.+)[ ]minutes[ ]?)|(?<unknown>(?:(?!(?:\b(?:but[ ])?with(?:in|out)?\b)).)+))*)$

https://regex101.com/r/RHfGnb/1

Expanded

 (?m)
 ^ I [ ] want [ ] to [ ] cook [ ] something
 (?= [ ] | $ )
 (?<Order>                      # (1 start)
      (?:
           (?<with>                      # (2 start)
                \b
                (?: but [ ] )?
                with [ ]
                (?:
                     (?!
                          (?:
                               \b
                               (?: but [ ] )?
                               with
                               (?: in | out )?
                               \b
                          )
                     )
                     .
                )*
           )                             # (2 end)
        |  (?<without>                   # (3 start)
                \b
                (?: but [ ] )?
                without [ ]
                (?:
                     (?!
                          (?:
                               \b
                               (?: but [ ] )?
                               with
                               (?: in | out )?
                               \b
                          )
                     )
                     .
                )*
           )                             # (3 end)
        |  (?<time>                      # (4 start)
                \b within [ ]
                (?<duration> .+ )             # (5)
                [ ] minutes [ ]? 
           )                             # (4 end)
        |  (?<unknown>                   # (6 start)
                (?:
                     (?!
                          (?:
                               \b
                               (?: but [ ] )?
                               with
                               (?: in | out )?
                               \b
                          )
                     )
                     .
                )+
           )                             # (6 end)
      )*
 )                             # (1 end)
 $
2

Your pattern is not bad. Once you identified the greedy nature of the quantifiers as the problem, just consider changing them to reluctant, i.e. replace .+ with .+?:

String[] examples = {
    "I want to cook something",
    "I want to cook something with chicken and cheese",
    "I want to cook something with chicken but without onions",
    "I want to cook something without onions but with chicken and cheese",
    "I want to cook something with candy but without nuts within 30 minutes" };

Pattern p = Pattern.compile("I want to cook something"
    + "((( but)? with (?<include>.+?))|(( but)? without (?<exclude>.+?))"
        + "|( within (?<duration>.+?) minutes))*");

for(String s: examples) {
    Matcher m = p.matcher(s);
    if(m.matches()) {
        System.out.println(s);
        if(m.start("include") >= 0) System.out.println("\tinclude: "+m.group("include"));
        if(m.start("exclude") >= 0) System.out.println("\texclude: "+m.group("exclude"));
        if(m.start("duration") >= 0) System.out.println("\tduration: "+m.group("duration"));
    }
}
I want to cook something
I want to cook something with chicken and cheese
    include: chicken and cheese
I want to cook something with chicken but without onions
    include: chicken
    exclude: onions
I want to cook something without onions but with chicken and cheese
    include: chicken and cheese
    exclude: onions
I want to cook something with candy but without nuts within 30 minutes
    include: candy
    exclude: nuts
    duration: 30

The only other things to change were to add an optional but to with, to allow without … but with, and the placement of the spaces, to match "I want to cook something" without a trailing space.

Holger
  • 285,553
  • 42
  • 434
  • 765
  • Not sure how this is matching the extra stuff there. Is it forced to match the whole line ? https://regex101.com/r/SSAROF/1. –  Nov 14 '19 at 21:27
  • @x15 there seems to be a difference between PHP and Java regarding the greedy quantifier at group1. It seems that in PHP, this combination of inner reluctant and outer greedy quantifier disables backtracking whereas in java, it does not. I only tested in Java. Unfortunately, the page you’ve linked doesn’t offer a “Java flavor”. – Holger Nov 15 '19 at 07:36
  • A test is to make the first string `I want to cook something ABC` If it doesn't match a partial, then implicit `^$` anchors surround the regex. And that's likely the reason. Consider that inside the alternation group are 3 literals followed by a non-greedy clause `.+?` that is required and will only match 1 character at a time before it checks the group again for another match. Since the groups alternations each contain literals that don't match the next character, the whole match stops at that point with everything satisfied. Basically, every regex engine works this way. –  Nov 15 '19 at 17:06
  • More tests https://regex101.com/r/09UPYv/1, https://regex101.com/r/mwXNUF/1, https://regex101.com/r/OWCtlI/1, https://regex101.com/r/FyGwDw/1 –  Nov 15 '19 at 17:11
  • @x15 the Java code of my answer uses `matches` rather then `find`, which is indeed implying that only matching the entire input is a valid result. And indeed, it works the same way: https://regex101.com/r/ax84Kn/1 So you are right, that makes the difference… – Holger Nov 20 '19 at 13:13