1

I must elaborate a regexp that should match string with the following restrictions:

  • The string must not start with foo
  • The string must not contain /foo
  • The string must end with bar

I came up with the following pattern, but I am pretty sure there are more elegant and/or efficient solutions:

String match = "quxfoobar";
String notMatch = "qux/foobar";
String notMatch2 = "fooquxbar";
String pattern = "(?!foo)(?!.+/foo).*bar";
boolean m = match.matches(pattern);

Thanks for your inputs.

NB : Please note that I am using Java with the String.matches() method to match my pattern against my candidates strings.

Alex Shesterov
  • 26,085
  • 12
  • 82
  • 103
zoom
  • 1,686
  • 2
  • 16
  • 27
  • I don't see anything glaringly wrong with your current one. If you want to debug, you can check out [this question](http://stackoverflow.com/questions/2348694/how-do-you-debug-a-regex). – Brian Feb 06 '13 at 17:14
  • 1
    Please, use a **constant** of `Pattern` for avoid recompile the regular expression in every call of `matches`. – Paul Vargas Feb 06 '13 at 17:18
  • 1
    Like @PaulVargas if you're concerned about efficiency, compile your pattern once into a `Pattern` and put it in a `static` variable, then do `pattern.matcher(match).matches` – Brian Feb 06 '13 at 17:19
  • The regepx is called inside a for loop. I declared the String pattern just above the for loop as follows : `final String p = "...";`. I guess that means there is a regexp compilation at each iteration in the loop while using `final Pattern p = Pattern.compile("...");` would avoid recompilation. As for the `static` modifier, I am not sure I need it for my use case. – zoom Feb 06 '13 at 21:54

3 Answers3

3

Why regex? For fixed string, there are already built-in functions, which should be much faster than regex approach.

if (!str.startsWith("foo") && str.endsWith("bar") && !str.contains("/foo")) {
    // Do your stuff here
}
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • +1, regex isn't always the best solution for text matching when you can basically just do linear searching like this. – Brian Feb 06 '13 at 18:05
  • I was hesitating about using this approach at first glance but I should not. It is obviously more efficient and way clearer (enhancing code maintainability). – zoom Feb 06 '13 at 21:42
0

This one could suit your needs too:

^(?!foo)(/(?!foo)|[^/])*bar$
  • ^(?!foo): the start of the string must not be followed by foo
  • (/(?!foo)|[^/])*: either a / that isn't followed by foo, or any char but a /, n times
  • bar$: the string must end with bar

Demo

But after some tests, yours is quicker than mine. Since I can't see any other relevant solution based on a regexp, I guess yours is finally the most elegant and/or efficient one ;)

sp00m
  • 47,968
  • 31
  • 142
  • 252
  • Does this perform better than his current one, though? Can you elaborate on this? His solution works, AFAIK. – Brian Feb 06 '13 at 17:17
  • 1
    @sp00m: Try to swap `[^/]` to front. The ordering affects which one the regex engine chooses to explore first. – nhahtdh Feb 06 '13 at 17:54
0

Regular expressions perform less well the longer and more complicated that they get. For something like this, it should be better performing and easier to understand to match the input against three regular expressions.

String match = "quxfoobar";
Pattern start = Pattern.compile("^foo");
Pattern contain = Pattern.compile("\\/foo");
Pattern end = Pattern.compile("bar$");
boolean m = (
   !start.matcher(match).find() &&
   !contain.matcher(match).find() &&
   end.matcher(match).find()
);

Edit: since there is some question as to whether three regex would be faster in this case, I wrote a benchmark. When I run it, the single regular expression (taken from another answer) is three times as slow as running with three separate regular expressions.

import java.util.regex.*;
import java.util.*;

public class Test {

    private static final Pattern START = Pattern.compile("^foo");
    private static final Pattern CONTAIN = Pattern.compile("\\/foo");
    private static final Pattern END = Pattern.compile("bar$");

    private static final Pattern ONEPATTERN = Pattern.compile("^(?!foo)(\\/(?!foo)|[^\\/])*bar$");



    public static void main(String[] args){
        String[] in = createInput();
        timeOnePattern(in);
        timeThreePatterns(in);
        System.exit(0);
    }

    public static String[] createInput(){
        String[] words = {"foo","bar","baz","biz","/foo"};
        Random rand = new Random();
        String[] in = new String[10000];
        for (int i=0; i<in.length; i++){
            StringBuilder sb = new StringBuilder();
            for (int j=0; j<4; j++){
                sb.append(words[rand.nextInt(words.length)]);
            }
            in[i] = sb.toString();
        }
        return in;
    }

    public static void timeThreePatterns(String[] in){
        long startTime = System.nanoTime();
        for (String s: in){
            boolean b = (!START.matcher(s).find() && !CONTAIN.matcher(s).find() && END.matcher(s).find());
        }
        long endTime = System.nanoTime();
        System.out.println("Three regular expressionv took " + (endTime - startTime) + " nanoseconds.");
    }

    public static void timeOnePattern(String[] in){
        long startTime = System.nanoTime();
        for (String s: in){
            ONEPATTERN.matcher(s).matches();
        }
        long endTime = System.nanoTime();
        System.out.println("Single regular expression took " + (endTime - startTime) + " nanoseconds.");
    }
}
Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
  • -1 unless you can provide a citation for your first sentence. They're implemented using state machines which have a finite running time. Running multiple expressions like this may actually take longer because you have to compile 3 different expressions. – Brian Feb 06 '13 at 17:16
  • Java regex is implement using the same techniques that Perl uses, NOT using state machines. Here is a good artcle about Perl regex vs NFA: http://swtch.com/~rsc/regexp/regexp1.html – Stephen Ostermiller Feb 06 '13 at 17:19
  • Fair enough, well-written regular expressions is what I should have said. In any case, do the look-aheads get compiled to an NFA? I believe they do, at which point having multiple expressions is unnecessary. – Brian Feb 06 '13 at 17:23
  • I added code to my answer that benchmarks three vs one. Three is indeed faster. – Stephen Ostermiller Feb 06 '13 at 17:49
  • Use the expression from the OP's question. As the other answer states in the comments: "After some tests, his is better than mine..." – Brian Feb 06 '13 at 17:51
  • The OP's regex is indeed faster than the one offered by spOOm. It still is only half as fast as using three regular expressions. Changing the single regular expression from .matches() to .find() make is ten times worse, so its using the best case scenario. The three regexes need find() instead of matches or they won't work at all. – Stephen Ostermiller Feb 06 '13 at 17:55
  • Take your test, add an extra 0 or two to the size of `String[] in`, change the inner `for` loop in `createInput` to generate a random number of words instead of always doing four, and lastly, switch the order of invocation for the tests (test your 3-expression first, and the 1-expression second). Your test isn't valid since it depends on certain properties of invocation order and hotspot compiling in the hotspot JVM. Granted, doing this, his still ran 10% slower, but that's only 10%. – Brian Feb 06 '13 at 18:02
  • You are correct that whichever runs second seems to run slightly faster, but not enough to make a single regular expression faster than three of them. Neither increasing the amount of input or randomly varying its length seems to have any effect. – Stephen Ostermiller Feb 06 '13 at 18:36
  • It should also be noted that we're talking about nanoseconds here, so we're really just splitting hairs. Besides, nhahtdh's solution is 10 times faster even than yours, so the whole argument is moot. I'll remove my downvote, but my statement about your first sentence still stands. It's a sweeping generalization, and well-written regular expressions get compiled into NFA (state machines), so it depends on the expression(s) itself, not on whether or not it's broken up into pieces. – Brian Feb 06 '13 at 19:04