1

I'm currently trying to detect any listings within a text, given by the user. I seem to fail in properly detecting those listings with a regular expression.

Example Text

a, b, c and d, or e

Rule Set

\w+(,?\s*\w+)+,?\s*(and|or)

Starting with one word on the left side suffices for my use case (denoted by the first \w+). Using Regular Expressions 101 to test the regular expression, shows that it works just fine with the example text above.

Using Java's Matcher class, I can simply check for the last group whether it is an and or or, to detect the "type" of the conjunction (so to speak).

However, a more complex input will cause a false detection of the listings. That is, multiple listings are detected as one rather than multiple.

Multiple Listings Example

a, b, c and d, or e but not f, g, h and i, or j

Again, testing with Regular Expressions 101 only one listing is detected (reaching from the start of the text until the very last or).

So, how would I alter the regular expression to detect multiple listings rather than all listings as one?

Also, I'm fine with any other solution, too. I just would like to solve this as clean as possible.


Finally, have some code to see an example implementation.

Main

import java.util.Arrays;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        Matcher matcher = Pattern.compile("\\w+(,?\\s*\\w+)+,?\\s*(and|or)").matcher("a, b, c and d, or e but not f, g, h and i, or j");

        while(matcher.find()){
            String conjunctionType = matcher.group(matcher.groupCount()).toLowerCase();

            Arrays.asList(Conjunction.values()).forEach(type -> {
                if(conjunctionType.equals(type.toString())){
                    System.out.println("Type: " + type);
                    System.out.println("Match: " + matcher.group());
                    // TODO: use the type for further processing
                }
            });
        }
    }
}

Conjunction Enum

public enum Conjunction {
    AND,
    OR;

    @Override
    public String toString(){
        return this.name().toLowerCase();
    }
}

Output

Type: or
Match: a, b, c and d, or e but not f, g, h and i, or

Desired Output

Type: or
Match: a, b, c and d, or
Type: or
Match: f, g, h and i, or

Update

I forgot to mention that any single letter in the regular expressions above are mere placeholders for any arbitrary amount of words.

An Even More Complex Example

a, b with some other words, c and d , or e but not f, g, h or i, and j
mcuenez
  • 1,579
  • 2
  • 20
  • 28

2 Answers2

1

The \w+ fails to distinguish a from but or not. It seems that you have to make comma a mandatory delimiter unless and is used and also define the and delimiter explicitly:

\w+(?:,\s*\w+(?:\s+and\s+\w+)?)+,?\s*(and|or)

Demo: https://regex101.com/r/NqlBLk/1

Dmitry Egorov
  • 9,542
  • 3
  • 22
  • 40
  • +1 for using `?:`, I didn't think of that. It seems, I forgot to mention some aspects, sorry - I'll updated the question. – mcuenez Apr 17 '17 at 14:24
0

I finally found a solution by making the regular expression partially non-greedy.

(\b\w+\b\s*,??\s*)+, (or|and)

Note the ?? in the regular expression (see here for more information). See this example for the final result. While ignoring the last "item" of the listings, this is sufficient for my use case.

Example Code

import java.util.Arrays;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        String text = "a, b, c and d, or e but not f, g, h and i, or j";
        String pattern = "(\\b\\w+\\b\\s*,??\\s*)+, (or|and)";      

        Matcher matcher = Pattern.compile(pattern).matcher(text);

        while(matcher.find()){
            String conjunctionType = matcher.group(matcher.groupCount()).toLowerCase();

            Arrays.asList(Conjunction.values()).forEach(type -> {
                if(conjunctionType.equals(type.toString())){
                    System.out.println("Type: " + type);
                    System.out.println("Match: " + matcher.group());
                    // TODO: use the type for further processing
                }
            });
        }
    }
}

Output

Type: or
Match: a, b, c and d, or
Type: or
Match: e but not f, g, h and i, or
Community
  • 1
  • 1
mcuenez
  • 1,579
  • 2
  • 20
  • 28