regex VS Contains. Best Performance?

Question

I want to compare an URI String over different patterns in java and I want fastest code possible.

Should I use :

if(uri.contains("/br/fab") || uri.contains("/br/err") || uri.contains("/br/sts")

Or something like :

if(uri.matches(".*/br/(fab|err|sts).*"))

Note that I can have a lot more uri and this method is called very often.

What is the best answer between my choices ?

In the immortal words of Harry Hill: "There's only one way to find out..." — Dan Dyer, Jan 07 '10 at 21:28
why don't you make some iterations so you can measure it by yourself... explanations will then make it richer. — Oso, Jan 07 '10 at 21:28
Better try a non-greedy `[^/]*(?:/[^/]*)*?/br/(fab|err|sts).*`. — Gumbo, Jan 07 '10 at 21:31
For the first option, you may want to study your data so that you know the expected frequency of 'fab', 'err', and 'sts'. In this way, you can order the expressions in the if condition accordingly. This _might_ help constructing with the regex as well, though I doubt it. ps. I strongly doubt if this is a bottleneck. — Michael Easter, Jan 07 '10 at 21:48
It's faster to use contains because regex will perform very poorly in this case. This particular expression pokes at a weakness in regex. — PSpeed, Jan 07 '10 at 22:10
@PSpeed: No, this example pokes at a weakness in Java's `String#matches()` method--two of them, in fact: it recompiles the regex every time, and it has to match the whole target string. Jon Skeet's `find()` suggestion eliminates both problems. — Alan Moore, Jan 08 '10 at 13:21
Even if you switch this to compiled Patterns the regex will be slower than the contains()... and will only get slower the more options you add. And find() doesn't fix that particular problem because of the | in the regex. Regex will start at pos 0, check all of the options, move to pos 1, check all of the options, and so on. The cases where this will perform faster are less than the cases where simple scanning will perform faster... since a scan can be more easily optimized. At any rate, for this example once the Pattern has been compiled the different should be slim. — PSpeed, Jan 08 '10 at 19:04
...though note that the prefix "/bar/" does help a lot in this case. Further narrows the performance gap between the two approaches. — PSpeed, Jan 08 '10 at 19:05

score 29 · Answer 1 · answered Jan 07 '10 at 21:30

If you're going to use a regular expression, create it up-front and reuse the same Pattern object:

private static final Pattern pattern = Pattern.compile(".*/br/(fab|err|sts).*");

Do you actually need the ".*" at each end? I wouldn't expect it to be required, if you use Matcher.find().

Which is faster? The easiest way to find out is to measure it against some sample data - with as realistic samples as possible. (The fastest solution may very well depend on

Are you already sure this is a bottleneck though? If you've already measured the code enough to find out that it's a bottleneck, I'm surprised you haven't just tried both already. If you haven't verified that it's a problem, that's the first thing to do before worrying about the "fastest code possible".

If it's not a bottleneck, I would personally opt for the non-regex version unless you're a regex junkie. Regular expressions are very powerful, but also very easy to get wrong.

Let's say it's a theory question. Optimization can be interesting even if it's not a bottleneck. — Robert Fraser, Jan 07 '10 at 21:35
@Robert: It can be, but this is likely to depend on the actual data. You could come up with a test showing one approach to be faster - and then with the OP's real data you could get the other answer. I think it's actually more important to learn the lesson that you should check for bottlenecks before you assume you need the "fastest code possible". — Jon Skeet, Jan 07 '10 at 22:04

score 16 · Answer 2 · edited Nov 16 '16 at 10:40

16

I've done a test and it is faster to use contains. As Ewan Todd said, they both fast enough to don't really bother with that.

edited Nov 16 '16 at 10:40

answered Jan 07 '10 at 21:49

Mike

2,354
3
23
37

4

if you have to it it triillions of times, it matters :-) – Boris Aug 02 '19 at 22:39

score 14 · Answer 3 · answered Nov 16 '18 at 07:57

Both are fast enough, but contains is faster. Facts: ~20mil ops vs ~1mil ops

Using the following jmh code to test

@State(Scope.Benchmark)
public class Main {

  private String uri = "https://google.com/asdfasdf/ptyojty/aeryethtr";

  @Benchmark
  @Warmup(iterations = 5)
  @Measurement(iterations = 5)
  @Fork(value = 1, warmups = 0)
  public void initContains() throws InterruptedException {
    if (uri.contains("/br/fab") || uri.contains("/br/err") || uri.contains("/br/sts")) {}
  }

  @Benchmark
  @Warmup(iterations = 5)
  @Measurement(iterations = 5)
  @Fork(value = 1, warmups = 0)
  public void initMatches() throws InterruptedException {
    if (uri.matches(".*/br/(fab|err|sts).*")) {}
  }

  public static void main(String[] args) throws Exception {
    org.openjdk.jmh.Main.main(args);
  }
}

The results

# Run complete. Total time: 00:00:37

Benchmark           Mode  Cnt         Score         Error  Units
Main.initContains  thrpt    5  21004897.968 ± 1987176.746  ops/s
Main.initMatches   thrpt    5   1177562.581 ±  248488.092  ops/s

score 3 · Answer 4 · answered Jan 07 '10 at 21:28

I would expect contains() to be faster since it won't have to compile and iterate through a (relatively) complex regular expression, but rather simply look for a sequence of characters.

But (as with all optimisations) you should measure this. Your particular situation may impact results, to a greater or lesser degree.

Furthermore, is this known to be causing you grief (wrt. performance) ? If not, I wouldn't worry about it too much, and choose the most appropriate solution for your requirements regardless of performance issues. Premature optimisation will cause you an inordinate amount of grief if you let it!

Panthro · Answer 5 · 2017-01-29T18:59:38.967

UPDATE: I know this is not the best benchmark code and for each case there are several ways to optimize it.

What I wanted to achieve was, for a regular developer that will use the simpler ways of doing things and it's not a JVM expert, that's the "common" way to use it, so here it goes.

ORIGINAL:

The below code produced the following output

 contains took: 70 
 matches took: 113 
 matches with pre pattern took: 419

The test class

public class MatchesTester {

public static void main(String[] args) {

    String typeStr = "Nunc rhoncus odio ac tellus pulvinar, et volutpat sapien aliquet. Nam sed libero nec ex laoreet pretium sed id mi. Aliquam erat volutpat. Aenean at erat vitae massa iaculis mattis. Quisque sagittis massa orci, sit amet vestibulum turpis tempor a. Etiam eget venenatis arcu. Nunc enim augue, pulvinar at nulla ut, pellentesque porta sapien. Maecenas ut erat id nisi tincidunt faucibus eget vel erat. Morbi quis magna et massa pharetra venenatis ut a lacus. Vivamus egestas vitae nulla eget tristique. Praesent consectetur, tellus quis bibendum suscipit, nisl turpis mattis sapien, ultrices mollis leo quam eu eros.application/binaryNunc rhoncus odio ac tellus pulvinar, et volutpat sapien aliquet. Nam sed libero nec ex laoreet pretium sed id mi. Aliquam erat volutpat. Aenean at erat vitae massa iaculis mattis. Quisque sagittis massa orci, sit amet vestibulum turpis tempor a. Etiam eget venenatis arcu. Nunc enim augue, pulvinar at nulla ut, pellentesque porta sapien. Maecenas ut erat id nisi tincidunt faucibus eget vel erat. Morbi quis magna et massa pharetra venenatis ut a lacus. Vivamus egestas vitae nulla eget tristique. Praesent consectetur, tellus quis bibendum suscipit, nisl turpis mattis sapien, ultrices mollis leo quam eu eros.";

    int timesToTest = 10000;
    long start =  System.currentTimeMillis();
    int count = 0;
    //test contains
    while(count < timesToTest){
        if (typeStr.contains("image") || typeStr.contains("audio") || typeStr.contains("video") || typeStr.contains("application")) {
            //do something non expensive like creating a simple native var
            int a = 10;
        }
        count++;
    }
    long end = System.currentTimeMillis();
    System.out.println("contains took: "+ (end - start));

    long start2 =  System.currentTimeMillis();
    count = 0;
    while(count < timesToTest){
        if (typeStr.matches("(image|audio|video|application)")) {
            //do something non expensive like creating a simple native var
            int a = 10;
        }
        count++;
    }
    long end2 = System.currentTimeMillis(); //new var to have the same cost as contains
    System.out.println("matches took: "+ (end2 - start2));


    long start3 =  System.currentTimeMillis();
    count = 0;
    Pattern pattern = Pattern.compile("(image|audio|video|application)");
    while(count < timesToTest){
        if (pattern.matcher(typeStr).find()) {
            //do something non expensive like creating a simple native var
            int a = 10;
        }
        count++;
    }
    long end3 = System.currentTimeMillis(); //new var to have the same cost as contains
    System.out.println("matches with pre pattern took: "+ (end3 - start3));


}

This benchmark is not really accurate. You should use a library such as Caliper to do so. See: http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java — Mike, Nov 17 '14 at 18:48
at the very least, the Pattern.compile call should be moved out of the time calculations, since this would ordinarily be done at startup or instantiation, not every time- — chrismarx, Jun 12 '15 at 18:24
You are creating multiple Matcher object by calling matcher(typeStr) for each iteration. You will see performance improvement if instead of creating new matcher every time, you create it once and reset it every time you want to use it for finding sub-string matches. `Pattern pattern = Pattern.compile("image|audio|video|application"); Matcher matcher = pattern.matcher(""); long start3 = System.currentTimeMillis(); while (count < repetition) { if (matcher.reset(typeStr).find()) { //do something non expensive like creating a simple native var int a = 0; } count++; }` — sactiw, Jul 09 '15 at 12:39
@sactiw that optimization only works in single-threaded use-cases, and the performance gain is not that great (about 1.5x faster in a benchmark I made), so in general I think it's better to create a new Matcher for thread safety and code clarity — mitchnull, Apr 23 '21 at 06:40
@mitchnull how is local `Matcher matcher = pattern.matcher("");` thread unsafe. It will be local variable for each thread , right ? Moreover, the optimization depends on size and scale in this case. It may be negligible for simple pattern and fewer iterations but it might be significant if the size is huge. That said, IMO, it is functionally correct and simple enough to read so in general it is better to go for it. — sactiw, Nov 09 '22 at 15:02
@sactiw your suggestion only makes sense if `matcher` is a shared resource, not a function-local variable (the while loop there is obviously just a speed test, you won't do that in real code). So my comment is about making this `matcher` a shared (between invocations) resource. That would not be thread-safe, so your function using that `matcher` would need synchronization in multi-threaded code killing the minor performance gain you could achieve here (if your code were guaranteed to be single-threaded) — mitchnull, Nov 09 '22 at 18:36

score 0 · Answer 6 · answered Jan 07 '10 at 21:32

If the bit you are trying to match against is always at the beginning, or end, or is in some other way predictable then: neither!

For example, if urls are like http://example.com/br/fab or http://example.com/br/err all the time, then you could store "br/fab" and "br/err" etc in a HashSet or similar, and then given an incoming URL, chop off the last part of it and query the Set to see if it contains it. This will scale better than either method you gave (with a HashSet it should get no slower to lookup entries no matter how many there are).

If you do need to match against substrings appearing in arbitrary locations... it depends what you mean by "a lot more". One thing you should do regardless of the specifics of the problem is try things out and benchmark them!

If it's only a few entries, using a HashSet would be slower than just using "endsWith" three times... A hashset would scale well in terms of number of entries to match *against*, but there's no indication that there will be a large number of these. It won't scale any better than the other methods for large numbers of URIs to check. — Jon Skeet, Jan 07 '10 at 21:34
I was under the impression he meant that he might be matching against many more entries. Yes, if it's just lots of uris and few entries, any method will scale linearly. — ZoFreX, Jan 08 '10 at 01:45

Nit · Answer 7 · 2016-08-31T14:59:30.980

-1

its much faster if you use indexOf().

if(uri.indexOf("/br/fab")>-1 || uri.indexOf("/br/err")>-1 || uri.indexOf("/br/sts") >-1 )
{
       your code.
}

and problem with contains() is internally it creates a Matcher(java.util.regex.Matcher) object and evalates the expression.

Matcher is a very costly thing if processing large amount of data.

edited Aug 31 '16 at 14:59

answered Aug 31 '16 at 13:33

Nit

23
3

6

`problem with contains() is internally it creates a Matcher`. No, it doesn't. At least the Oracle JDK uses indexOf in contains. I doubt that any JDK is so bad that it uses Matcher in contains. – Vsevolod Golovanov Sep 07 '16 at 13:12

score -12 · Accepted Answer · answered Jan 07 '10 at 21:31

-12

They're both fast enough to be over before you know it. I'd go for the one that you can read more easily.

answered Jan 07 '10 at 21:31

Ewan Todd

7,315
26
33

10

I agree with you that in most of the cases it doesn't make a difference, but I'd like to know also which one is the fastest. I once worked in this application that had to do the same procedure for over 20kk entries (like files, or wikipedia pages), every millisecond we could gain in performance was key to success as the whole "scan" was taking up to 15 days. – Panthro Jul 08 '15 at 10:05
11

Doesn't answer the question the OP asked. Downvote. – searchengine27 Jul 30 '15 at 20:45
10

Very poor answer. You make performance tests over thousands or millions of iterations. – Juan Acevedo Dec 17 '15 at 22:58
2

Having fast enough for human eyes is not answering the OP'e question. Code is not always written for human read. The question was specific for performance – The Java Guy Feb 14 '18 at 03:58
2

Need to be more specific with valid proofs to prove your point. Downvote. – NixRam Jul 23 '18 at 10:01
This answer doesn't consider the situations where performance is consequential. In cases of data analytics, the 200ns in extra compute time will translate to days of additional compute. Compute power is not free and neither is time. – Iyad K Feb 29 '20 at 21:31

regex VS Contains. Best Performance?

8 Answers8

Linked