12

My purpose is to match this kind of different urls:
url.com
my.url.com
my.extended.url.com
a.super.extended.url.com
and so on...

So, I decided to build the regex to have a letter or a number at start and end of the url, and to have a infinite number of "subdomains" with alphanumeric characters and a dot. For example, in "my.extended.url.com", "m" from "my" is the first class of the regex, "m" from "com" is the last class of the regex, and "y.", "extended." and "url." are the second class of the regex.

Using the pattern and subject in the code below, I want the find method to return me a false because this url must not match, but it uses 100% of CPU and seems to stay in an infinite loop.

    
    String subject = "www.association-belgo-palestinienne-be";
    Pattern pattern = Pattern.compile("^[A-Za-z0-9]\\.?([A-Za-z0-9_-]+\\.?)*[A-Za-z0-9]\\.[A-Za-z]{2,6}");

    Matcher m = pattern.matcher(subject);
    System.out.println("    Start");
    boolean hasFind = m.find();
    System.out.println("    Finish : " + hasFind);
  

Which only prints:

  
      Start
  

I can't reproduce the problem using regex testers.
Is it normal ? Is the problem coming from my regex ?
Could it be due to my Java version (1.6.0_22-b04 / JVM 64 bit 17.1-b03) ?

Thanks in advance for helping.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
carpediem
  • 135
  • 1
  • 2
  • 8

5 Answers5

19

The problem is the ([A-Za-z0-9_-]+\\.?)* part of the regular expression. Note that it has a quantifier (+) inside another quantifier (*). This causes catastrophic backtracking - basically, it has to try an exponential number of matches in order to check the regular expression, at least the way most regular expression engines are implemented (including the Java one).

If you use possessive quantifiers, you will be able to avoid this problem, however that would change the meaning of your regex, and it would no longer match what you want it to match.

I think the trick here is to find a regex which expresses what you want to solve, without double quantifiers. For example, the following should work:

Pattern.compile("^[A-Za-z0-9]\\.?([A-Za-z0-9_-]|[A-Za-z0-9_-]\\.)*[A-Za-z0-9]\\.[A-Za-z]{2,6}$");

I think this expresses the same class of strings that you are trying to match, and should be much faster.

Avi
  • 19,934
  • 4
  • 57
  • 70
  • Thank you. It works! But I take the time to fully understand the concept and I come back to you. – carpediem Dec 21 '10 at 15:22
  • Actually, it seems I was wrong about the fix :-(, though not about the problem. It was returning false even for things that should match, such as "www.association-belgo-palestinienne-be.com". I've corrected the solution - see if it works now for you. – Avi Dec 21 '10 at 15:30
  • On looking over the other answers, I think @marcog's solution should also work, and is cleaner than mine. – Avi Dec 21 '10 at 15:32
  • yes, I tried it and some cases are not supported but let me thank for introducing me to possessive quantifiers ! – carpediem Dec 21 '10 at 15:59
13

It's not an infinite loop. The problem is that it's checking every possible match and not finding one. If you could let it run for a gazillion years, it will eventually terminate. See this article for a good explanation of what's happening under the hood.

Perhaps this regular expression is satisfactory (it terminates on the given string): ^[A-Za-z0-9][A-Za-z0-9_-]*(\\.[A-Za-z0-9_-]+)*\\.[A-Za-z]{2,6}$ (see http://ideone.com/Z0rlg)

moinudin
  • 134,091
  • 45
  • 190
  • 216
  • This is not my problem. I know the url is bad and this is precisely why I want the method to return a false response. The problem is that the method seems to keep running and up the CPU. – carpediem Dec 21 '10 at 15:02
  • @carpediem What makes the url bad? – moinudin Dec 21 '10 at 15:06
  • @marcog because it should have an extension. My regex works perfectly with all examples I showed in my question and are valid urls. The regex works also with bad urls, for those it returns false. This case seems to be very particular and difficult to debug because online testers works correctly. – carpediem Dec 21 '10 at 15:11
  • @carpedium Why isn't `association-belgo-palestinienne-be` the "extension" (actually called tld)? If you want to constrain the tld then you need to define them in your regex. Check http://regexlib.com/Search.aspx?k=URL for some sane examples. – moinudin Dec 21 '10 at 15:14
  • @marcog I really don't understand why you talk me about the tld. The regex is to match it due to the "\\.[A-Za-z]{2,6}" part and previous elements. A new time, my problem is only about the fact the method never stops. – carpediem Dec 21 '10 at 15:19
  • @carpediem I answered that question directly in my second paragraph. Now I'm trying to help you find a regex that doesn't have an exponential aspect so that it does terminate in a reasonable time, but I need more details about the strings you want to match. – moinudin Dec 21 '10 at 15:22
  • @carpediem See the regex I just added to my answer. If that doesn't work for you, tell me where it fails. – moinudin Dec 21 '10 at 15:25
  • @marcog Your regex will only match with an url like my.u.t.com if I understand well. But modifying it as ^[A-Za-z0-9][A-Za-z0-9_-]*(\\.[A-Za-z0-9_-]+)*[A-Za-z0-9]\\.[A-Za-z]{2,6}$ can cover more cases. I will come back to confirm this. – carpediem Dec 21 '10 at 15:57
  • @marcog it doesn't matter :-) My regex is a little bit more restrictive because the domain of the url must finish with an alphanumeric character. I would like to thank both Avi and you, and accept your answer because yours helped me more to find the solution. – carpediem Dec 21 '10 at 16:45
5

It isn't really an infinite loop, it's just taking a really long time. For all practical purposes, we can call it a hang.

Your Regex may be improved.

Try to put $ at the end of it. It will say that this is the end of the line. It may help you saving time.

Edit :

 String subject = "www-association-belgo-palestinienne-be";
 Pattern pattern = Pattern.compile("^[A-Za-z0-9]([-_A-Za-z0-9]*)(\\.[-_A-Za-z0-9]+)*\\.([-_A-Za-z0-9]+\\.)*([-_A-Za-z0-9]*)[A-Za-z0-9]$");

 Matcher m = pattern.matcher(subject);
 System.out.println("    Start");
 boolean hasFind = m.find();
 System.out.println("    Finish : " + hasFind);
LaGrandMere
  • 10,265
  • 1
  • 33
  • 41
  • I don't think so. Even if I wait for a long time (around 20 minutes), the find is never released. – carpediem Dec 21 '10 at 14:57
  • @carpediem Testing it right now, I 'll edit if I find something :) – LaGrandMere Dec 21 '10 at 15:03
  • @carpediem: Why should that be impossible? It is like brute-forcing. – jwueller Dec 21 '10 at 15:06
  • @LaGrandMere I can add that I reproduced the problem on NetBeans 6.7.1 and JDK 1.6 – carpediem Dec 21 '10 at 15:15
  • @marcog: Wow, nice. Very insightful. – jwueller Dec 21 '10 at 15:16
  • I can assure is not an infinite loop. I'm debugging it with The Regex Coach and it take exponentially more time for any character in the target string I add. – Ither Dec 21 '10 at 15:18
  • @carpediem : with my regex, www.association-belgo-palestinienne-be returns true. Because there is no more {2-6} at the end of it. You didn't talk about this in your post, so I went without it, if you need it, I can edit my Regex ;) – LaGrandMere Dec 21 '10 at 15:21
  • @LaGrandMere thank you but actually, your regex doesn't match my purpose because I need to support the extension match as in my question regex. – carpediem Dec 21 '10 at 15:25
  • "Even if I wait for a long time (around 20 minutes)" 20 minutes is **not even close** to a long time in this context. More like 20 centuries. We're dealing with a very large (but still quite finite) iteration count here. – Jon Molnar May 07 '15 at 18:34
1

See How do you debug a regex?.

Specifically, I would try regexpal, and change the java backslashes to single ones.

Community
  • 1
  • 1
Yuval F
  • 20,565
  • 5
  • 44
  • 69
  • I tried with this one http://www.regexplanet.com/simple/ which uses the Java Pattern class (can't try the one you suggested me which down both chrome and ie8 !!). The result is the one I expected as I said in my question. So, I have some doubts about my Java version. – carpediem Dec 21 '10 at 15:08
  • The double backslashes are required in Java, as `\\.` is the only way to match a literal period. – Avi Dec 21 '10 at 15:47
-3

It is an obvious Java regexp implementation bug. Look at the results with Your regexp and input data here

and You will see how quickly this is evaluated

foo
  • 574
  • 8
  • 13