Regexp for string which doesn't start with a word, doesn't contain another word, but does end with the third one

Question

I must elaborate a regexp that should match string with the following restrictions:

The string must not start with foo
The string must not contain /foo
The string must end with bar

I came up with the following pattern, but I am pretty sure there are more elegant and/or efficient solutions:

String match = "quxfoobar";
String notMatch = "qux/foobar";
String notMatch2 = "fooquxbar";
String pattern = "(?!foo)(?!.+/foo).*bar";
boolean m = match.matches(pattern);

Thanks for your inputs.

NB : Please note that I am using Java with the String.matches() method to match my pattern against my candidates strings.

I don't see anything glaringly wrong with your current one. If you want to debug, you can check out [this question](http://stackoverflow.com/questions/2348694/how-do-you-debug-a-regex). — Brian, Feb 06 '13 at 17:14
Please, use a **constant** of `Pattern` for avoid recompile the regular expression in every call of `matches`. — Paul Vargas, Feb 06 '13 at 17:18
Like @PaulVargas if you're concerned about efficiency, compile your pattern once into a `Pattern` and put it in a `static` variable, then do `pattern.matcher(match).matches` — Brian, Feb 06 '13 at 17:19
The regepx is called inside a for loop. I declared the String pattern just above the for loop as follows : `final String p = "...";`. I guess that means there is a regexp compilation at each iteration in the loop while using `final Pattern p = Pattern.compile("...");` would avoid recompilation. As for the `static` modifier, I am not sure I need it for my use case. — zoom, Feb 06 '13 at 21:54

score 3 · Accepted Answer · answered Feb 06 '13 at 17:58

3

Why regex? For fixed string, there are already built-in functions, which should be much faster than regex approach.

if (!str.startsWith("foo") && str.endsWith("bar") && !str.contains("/foo")) {
    // Do your stuff here
}

answered Feb 06 '13 at 17:58

nhahtdh

55,989
15
126
162

+1, regex isn't always the best solution for text matching when you can basically just do linear searching like this. – Brian Feb 06 '13 at 18:05
I was hesitating about using this approach at first glance but I should not. It is obviously more efficient and way clearer (enhancing code maintainability). – zoom Feb 06 '13 at 21:42

sp00m · Answer 2 · 2013-02-06T17:35:26.840

0

This one could suit your needs too:

^(?!foo)(/(?!foo)|[^/])*bar$

^(?!foo): the start of the string must not be followed by foo
(/(?!foo)|[^/])*: either a / that isn't followed by foo, or any char but a /, n times
bar$: the string must end with bar

Demo

But after some tests, yours is quicker than mine. Since I can't see any other relevant solution based on a regexp, I guess yours is finally the most elegant and/or efficient one ;)

edited Feb 06 '13 at 17:35

answered Feb 06 '13 at 17:13

sp00m

47,968
31
142
252

Does this perform better than his current one, though? Can you elaborate on this? His solution works, AFAIK. – Brian Feb 06 '13 at 17:17
1

@sp00m: Try to swap `[^/]` to front. The ordering affects which one the regex engine chooses to explore first. – nhahtdh Feb 06 '13 at 17:54

Stephen Ostermiller · Answer 3 · 2013-02-06T17:49:12.500

0

Regular expressions perform less well the longer and more complicated that they get. For something like this, it should be better performing and easier to understand to match the input against three regular expressions.

String match = "quxfoobar";
Pattern start = Pattern.compile("^foo");
Pattern contain = Pattern.compile("\\/foo");
Pattern end = Pattern.compile("bar$");
boolean m = (
   !start.matcher(match).find() &&
   !contain.matcher(match).find() &&
   end.matcher(match).find()
);

Edit: since there is some question as to whether three regex would be faster in this case, I wrote a benchmark. When I run it, the single regular expression (taken from another answer) is three times as slow as running with three separate regular expressions.

import java.util.regex.*;
import java.util.*;

public class Test {

    private static final Pattern START = Pattern.compile("^foo");
    private static final Pattern CONTAIN = Pattern.compile("\\/foo");
    private static final Pattern END = Pattern.compile("bar$");

    private static final Pattern ONEPATTERN = Pattern.compile("^(?!foo)(\\/(?!foo)|[^\\/])*bar$");



    public static void main(String[] args){
        String[] in = createInput();
        timeOnePattern(in);
        timeThreePatterns(in);
        System.exit(0);
    }

    public static String[] createInput(){
        String[] words = {"foo","bar","baz","biz","/foo"};
        Random rand = new Random();
        String[] in = new String[10000];
        for (int i=0; i<in.length; i++){
            StringBuilder sb = new StringBuilder();
            for (int j=0; j<4; j++){
                sb.append(words[rand.nextInt(words.length)]);
            }
            in[i] = sb.toString();
        }
        return in;
    }

    public static void timeThreePatterns(String[] in){
        long startTime = System.nanoTime();
        for (String s: in){
            boolean b = (!START.matcher(s).find() && !CONTAIN.matcher(s).find() && END.matcher(s).find());
        }
        long endTime = System.nanoTime();
        System.out.println("Three regular expressionv took " + (endTime - startTime) + " nanoseconds.");
    }

    public static void timeOnePattern(String[] in){
        long startTime = System.nanoTime();
        for (String s: in){
            ONEPATTERN.matcher(s).matches();
        }
        long endTime = System.nanoTime();
        System.out.println("Single regular expression took " + (endTime - startTime) + " nanoseconds.");
    }
}

edited Feb 06 '13 at 17:49

answered Feb 06 '13 at 17:14

Stephen Ostermiller

23,933
14
88
109

-1 unless you can provide a citation for your first sentence. They're implemented using state machines which have a finite running time. Running multiple expressions like this may actually take longer because you have to compile 3 different expressions. – Brian Feb 06 '13 at 17:16
Java regex is implement using the same techniques that Perl uses, NOT using state machines. Here is a good artcle about Perl regex vs NFA: http://swtch.com/~rsc/regexp/regexp1.html – Stephen Ostermiller Feb 06 '13 at 17:19
Fair enough, well-written regular expressions is what I should have said. In any case, do the look-aheads get compiled to an NFA? I believe they do, at which point having multiple expressions is unnecessary. – Brian Feb 06 '13 at 17:23
I added code to my answer that benchmarks three vs one. Three is indeed faster. – Stephen Ostermiller Feb 06 '13 at 17:49
Use the expression from the OP's question. As the other answer states in the comments: "After some tests, his is better than mine..." – Brian Feb 06 '13 at 17:51
The OP's regex is indeed faster than the one offered by spOOm. It still is only half as fast as using three regular expressions. Changing the single regular expression from .matches() to .find() make is ten times worse, so its using the best case scenario. The three regexes need find() instead of matches or they won't work at all. – Stephen Ostermiller Feb 06 '13 at 17:55
Take your test, add an extra 0 or two to the size of `String[] in`, change the inner `for` loop in `createInput` to generate a random number of words instead of always doing four, and lastly, switch the order of invocation for the tests (test your 3-expression first, and the 1-expression second). Your test isn't valid since it depends on certain properties of invocation order and hotspot compiling in the hotspot JVM. Granted, doing this, his still ran 10% slower, but that's only 10%. – Brian Feb 06 '13 at 18:02
You are correct that whichever runs second seems to run slightly faster, but not enough to make a single regular expression faster than three of them. Neither increasing the amount of input or randomly varying its length seems to have any effect. – Stephen Ostermiller Feb 06 '13 at 18:36
It should also be noted that we're talking about nanoseconds here, so we're really just splitting hairs. Besides, nhahtdh's solution is 10 times faster even than yours, so the whole argument is moot. I'll remove my downvote, but my statement about your first sentence still stands. It's a sweeping generalization, and well-written regular expressions get compiled into NFA (state machines), so it depends on the expression(s) itself, not on whether or not it's broken up into pieces. – Brian Feb 06 '13 at 19:04

Regexp for string which doesn't start with a word, doesn't contain another word, but does end with the third one

3 Answers3