129

What is the importance of Pattern.compile() method?
Why do I need to compile the regex string before getting the Matcher object?

For example :

String regex = "((\\S+)\\s*some\\s*";

Pattern pattern = Pattern.compile(regex); // why do I need to compile
Matcher matcher = pattern.matcher(text);
River
  • 8,585
  • 14
  • 54
  • 67
Sidharth
  • 3,629
  • 7
  • 26
  • 22
  • 2
    Well, the importance is almost NONE if the implementation (like in JDK 1.7) is just a mere SHORTCUT to new Pattern(regex, 0); That said, the REAL importance is not the static method itself, but the creation and return of a new Pattern that can be saved for latter use. Maybe there are other implementations where the static method takes a new route and caches the Pattern objects, and that would be a real case of Pattern.compile() importance! – marcolopes Jul 06 '14 at 23:13
  • 1
    The answers highlight the importance of separating pattern and matching classes (which is probably what the question asks), but nobody answers why can't we just use a constructor `new Pattern(regex)` instead of a static compile function. marcolopes comment is on spot. – kon psych Mar 02 '17 at 21:02

7 Answers7

158

The compile() method is always called at some point; it's the only way to create a Pattern object. So the question is really, why should you call it explicitly? One reason is that you need a reference to the Matcher object so you can use its methods, like group(int) to retrieve the contents of capturing groups. The only way to get ahold of the Matcher object is through the Pattern object's matcher() method, and the only way to get ahold of the Pattern object is through the compile() method. Then there's the find() method which, unlike matches(), is not duplicated in the String or Pattern classes.

The other reason is to avoid creating the same Pattern object over and over. Every time you use one of the regex-powered methods in String (or the static matches() method in Pattern), it creates a new Pattern and a new Matcher. So this code snippet:

for (String s : myStringList) {
    if ( s.matches("\\d+") ) {
        doSomething();
    }
}

...is exactly equivalent to this:

for (String s : myStringList) {
    if ( Pattern.compile("\\d+").matcher(s).matches() ) {
        doSomething();
    }
}

Obviously, that's doing a lot of unnecessary work. In fact, it can easily take longer to compile the regex and instantiate the Pattern object, than it does to perform an actual match. So it usually makes sense to pull that step out of the loop. You can create the Matcher ahead of time as well, though they're not nearly so expensive:

Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher("");
for (String s : myStringList) {
    if ( m.reset(s).matches() ) {
        doSomething();
    }
}

If you're familiar with .NET regexes, you may be wondering if Java's compile() method is related to .NET's RegexOptions.Compiled modifier; the answer is no. Java's Pattern.compile() method is merely equivalent to .NET's Regex constructor. When you specify the Compiled option:

Regex r = new Regex(@"\d+", RegexOptions.Compiled); 

...it compiles the regex directly to CIL byte code, allowing it to perform much faster, but at a significant cost in up-front processing and memory use--think of it as steroids for regexes. Java has no equivalent; there's no difference between a Pattern that's created behind the scenes by String#matches(String) and one you create explicitly with Pattern#compile(String).

(EDIT: I originally said that all .NET Regex objects are cached, which is incorrect. Since .NET 2.0, automatic caching occurs only with static methods like Regex.Matches(), not when you call a Regex constructor directly. ref)

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • 1
    Yet, this does not explain the importance of such a TRIVIAL method on the Pattern class! I always assumed that the static method Pattern.compile was much more than a simple SHORTCUT to new Pattern(regex, 0); I was expecting a CACHE of compiled patterns... i was wrong. Maybe creating a cache is more expensive than creating new patterns??! – marcolopes Jul 06 '14 at 23:07
  • 11
    Please note that Matcher class is not thread safe and should not be shared across threads. On the other hand Pattern.compile() is. – gswierczynski Aug 06 '14 at 01:01
  • 1
    TLDR; "... [Pattern.compile(...)] compiles the regex directly to CIL byte code, allowing it to perform much faster, but at a significant cost in up-front processing and memory use" – sean.boyer Sep 07 '16 at 18:36
  • 3
    While it's true that Matchers aren't nearly as expensive as Pattern.compile I did some metrics in a scenario where thousands of regex matches were happening and there was an additional, very significant saving by creating the Matcher ahead of time and reusing it via matcher.reset(). Avoiding the creation of new objects in the heap in methods called thousands of times is usually much lighter on CPU, memory and thus the GC. – Volksman Apr 14 '18 at 03:22
  • @Volksman that is not safe general advice because Matcher objects are not threadsafe. It’s also not relevant to the question. But yes, you could `reset` a Matcher object that is only ever used by one thread at a time in order to reduce allocations. – AndrewF Jul 03 '19 at 20:40
  • @AndrewF My scenario was for processing a large amount of data where order of processing was significant so it was all done in a single thread so no problems with thread safety there - I agree that it would be very unsafe to share Matcher instances across multiple threads. – Volksman Jul 04 '19 at 21:08
47

Compile parses the regular expression and builds an in-memory representation. The overhead to compile is significant compared to a match. If you're using a pattern repeatedly it will gain some performance to cache the compiled pattern.

Thomas Jung
  • 32,428
  • 9
  • 84
  • 114
  • 10
    Plus you can specify flags like case_insensitive, dot_all, etc. during compilation, by passing in an extra flags parameter – Sam Barnum Nov 12 '09 at 14:50
17

When you compile the Pattern Java does some computation to make finding matches in Strings faster. (Builds an in-memory representation of the regex)

If you are going to reuse the Pattern multiple times you would see a vast performance increase over creating a new Pattern every time.

In the case of only using the Pattern once, the compiling step just seems like an extra line of code, but, in fact, it can be very helpful in the general case.

jjnguy
  • 136,852
  • 53
  • 295
  • 323
  • 5
    Of course you can write it all in one line `Matcher matched = Pattern.compile(regex).matcher(text);`. There are advantages to this over introducing a single method: the arguments are effectively named and it is obvious how to factor out the `Pattern` for better performance (or to split across methods). – Tom Hawtin - tackline Nov 12 '09 at 06:14
  • 1
    It always seems like you know so much about Java. They should hire you to work for them... – jjnguy Nov 12 '09 at 06:24
6

It is matter of performance and memory usage, compile and keep the complied pattern if you need to use it a lot. A typical usage of regex is to validated user input (format), and also format output data for users, in these classes, saving the complied pattern, seems quite logical as they usually called a lot.

Below is a sample validator, which is really called a lot :)

public class AmountValidator {
    //Accept 123 - 123,456 - 123,345.34
    private static final String AMOUNT_REGEX="\\d{1,3}(,\\d{3})*(\\.\\d{1,4})?|\\.\\d{1,4}";
    //Compile and save the pattern  
    private static final Pattern AMOUNT_PATTERN = Pattern.compile(AMOUNT_REGEX);


    public boolean validate(String amount){

         if (!AMOUNT_PATTERN.matcher(amount).matches()) {
            return false;
         }    
        return true;
    }    
}

As mentioned by @Alan Moore, if you have reusable regex in your code, (before a loop for example), you must compile and save pattern for reuse.

Alireza Fattahi
  • 42,517
  • 14
  • 123
  • 173
4

Pattern.compile() allow to reuse a regex multiple times (it is threadsafe). The performance benefit can be quite significant.

I did a quick benchmark:

    @Test
    public void recompile() {
        var before = Instant.now();
        for (int i = 0; i < 1_000_000; i++) {
            Pattern.compile("ab").matcher("abcde").matches();
        }
        System.out.println("recompile " + Duration.between(before, Instant.now()));
    }

    @Test
    public void compileOnce() {
        var pattern = Pattern.compile("ab");
        var before = Instant.now();
        for (int i = 0; i < 1_000_000; i++) {
            pattern.matcher("abcde").matches();
        }
        System.out.println("compile once " + Duration.between(before, Instant.now()));
    }

compileOnce was between 3x and 4x faster. I guess it highly depends on the regex itself but for a regex that is often used, I go for a static Pattern pattern = Pattern.compile(...)

apflieger
  • 912
  • 10
  • 18
0

Similar to 'Pattern.compile' there is 'RECompiler.compile' [from com.sun.org.apache.regexp.internal] where:
1. compiled code for pattern [a-z] has 'az' in it
2. compiled code for pattern [0-9] has '09' in it
3. compiled code for pattern [abc] has 'aabbcc' in it.

Thus compiled code is a great way to generalize multiple cases. Thus instead of having different code handling situation 1,2 and 3 . The problem reduces to comparing with the ascii of present and next element in the compiled code, hence the pairs. Thus
a. anything with ascii between a and z is between a and z
b. anything with ascii between 'a and a is definitely 'a'

0

Pattern class is the entry point of the regex engine.You can use it through Pattern.matches() and Pattern.comiple(). #Difference between these two. matches()- for quickly check if a text (String) matches a given regular expression comiple()- create the reference of Pattern. So can use multiple times to match the regular expression against multiple texts.

For reference:

public static void main(String[] args) {
     //single time uses
     String text="The Moon is far away from the Earth";
     String pattern = ".*is.*";
     boolean matches=Pattern.matches(pattern,text);
     System.out.println("Matches::"+matches);

    //multiple time uses
     Pattern p= Pattern.compile("ab");
     Matcher  m=p.matcher("abaaaba");
     while(m.find()) {
         System.out.println(m.start()+ " ");
     }
}
vkstream
  • 881
  • 8
  • 8