I'm currently trying to detect any listings within a text, given by the user. I seem to fail in properly detecting those listings with a regular expression.
Example Text
a, b, c and d, or e
Rule Set
\w+(,?\s*\w+)+,?\s*(and|or)
Starting with one word on the left side suffices for my use case (denoted by the first \w+
). Using Regular Expressions 101 to test the regular expression, shows that it works just fine with the example text above.
Using Java's Matcher
class, I can simply check for the last group whether it is an and or or, to detect the "type" of the conjunction (so to speak).
However, a more complex input will cause a false detection of the listings. That is, multiple listings are detected as one rather than multiple.
Multiple Listings Example
a, b, c and d, or e but not f, g, h and i, or j
Again, testing with Regular Expressions 101 only one listing is detected (reaching from the start of the text until the very last or).
So, how would I alter the regular expression to detect multiple listings rather than all listings as one?
Also, I'm fine with any other solution, too. I just would like to solve this as clean as possible.
Finally, have some code to see an example implementation.
Main
import java.util.Arrays;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
Matcher matcher = Pattern.compile("\\w+(,?\\s*\\w+)+,?\\s*(and|or)").matcher("a, b, c and d, or e but not f, g, h and i, or j");
while(matcher.find()){
String conjunctionType = matcher.group(matcher.groupCount()).toLowerCase();
Arrays.asList(Conjunction.values()).forEach(type -> {
if(conjunctionType.equals(type.toString())){
System.out.println("Type: " + type);
System.out.println("Match: " + matcher.group());
// TODO: use the type for further processing
}
});
}
}
}
Conjunction Enum
public enum Conjunction {
AND,
OR;
@Override
public String toString(){
return this.name().toLowerCase();
}
}
Output
Type: or
Match: a, b, c and d, or e but not f, g, h and i, or
Desired Output
Type: or
Match: a, b, c and d, or
Type: or
Match: f, g, h and i, or
Update
I forgot to mention that any single letter in the regular expressions above are mere placeholders for any arbitrary amount of words.
An Even More Complex Example
a, b with some other words, c and d , or e but not f, g, h or i, and j