28

Java is hanging with 100% CPU usage when I use the below string as input for a regular expression.

RegEx Used:

Here is the regular expression used for the description field in my application.

^([A-Za-z0-9\\-\\_\\.\\&\\,]+[\\s]*)+

String used for testing:

SaaS Service VLAN from Provider_One
2nd attempt with Didier SPT because the first one he gave me was wrong :-(

It works properly when I split the same string in different combinations. Like "SaaS Service VLAN from Provider_One", "first one he gave me was wrong :-(", etc. Java is hanging only for the above given string.

I also tried optimizing the regex as below.

^([\\w\\-\\.\\&\\,]+[\\s]*)+

Even with this is not working.

Kevin J. Chase
  • 3,856
  • 4
  • 21
  • 43
user1531484
  • 299
  • 4
  • 5

3 Answers3

60

Another classic case of catastrophic backtracking.

You have nested quantifiers that cause a gigantic number of permutations to be checked when the regex arrives at the : in your input string which is not part of your character class (assuming you're using the .matches() method).

Let's simplify the problem to this regex:

^([^:]+)+$

and this string:

1234:

The regex engine needs to check

1234    # no repetition of the capturing group
123 4   # first repetition of the group: 123; second repetition: 4
12 34   # etc.
12 3 4 
1 234
1 23 4
1 2 34
1 2 3 4

...and that's just for four characters. On your sample string, RegexBuddy aborts after 1 million attempts. Java will happily keep on chugging... before finally admitting that none of these combinations allows the following : to match.

How can you solve this?

You can forbid the regex from backtracking by using possessive quantifiers:

^([A-Za-z0-9_.&,-]++\\s*+)+

will allow the regex to fail faster. Incidentally, I removed all those unnecessary backslashes.

Edit:

A few measurements:

On the string "was wrong :-)", it takes RegexBuddy 862 steps to figure out a non-match.
For "me was wrong :-)", it's 1,742 steps.
For "gave me was wrong :-)", 14,014 steps.
For "he gave me was wrong :-)", 28,046 steps.
For "one he gave me was wrong :-)", 112,222 steps.
For "first one he gave me was wrong :-)", >1,000,000 steps.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • You need to keep the backslashes for the `.`. – Vala Jul 17 '12 at 13:25
  • 24
    @Thor84no: No. Inside a character class, a dot means a dot. – Tim Pietzcker Jul 17 '12 at 13:26
  • Thanks for the answer. Yes. I am using the .matches() method. The updated regex is working fine. Could you plase explain how the backslash will impact performance and what is the significance of ++ in the above regex?? Thanks – user1531484 Jul 18 '12 at 06:09
  • 2
    The needless backslashes are harmless to the code but hurtful to the eye. – tripleee Jul 18 '12 at 07:45
  • 1
    A possessive quantifier prohibits backtracking. "If you get this far and have a partial match, you can't go back. If you fail beyond this point, give up." – tripleee Jul 18 '12 at 07:47
  • 1
    @user1531484: tripleee already answered your question; I have added another link to the regex tutorial by regex guru Jan Goyvaerts (to the section concerning possessive quantifiers). I highly recommend the entire tutorial. – Tim Pietzcker Jul 18 '12 at 07:53
17

First, you need to realize that your regexes CANNOT match the supplied input string. The strings contain a number of characters ('<' '>' '/' ':' and ')') that are not "word" characters.

So why is it taking so long?

Basically "catastrophic backtracking". More specifically, the repeating structures of your regex give an exponential number of alternatives for the regex backtracking algorithm to try!

Here's what your regex says:

  1. One or more word characters
  2. Followed by zero or more space characters
  3. Repeat the previous 2 patterns as many times as you like.

The problem is with the "zero or more space characters" part. The first time, the matcher will match everything up to the first unexpected character (i.e. the '<'). Then it will back off a bit and try again with a different alternative ... that involves "zero spaces" before the last letter, then when that fails, it will move the "zero spaces" back one position.

The problem is that for String with N non-space characters, there as N different places that "zero spaces" can be matched, and that makes 2^N different combinations. That rapidly turns into a HUGE number as N grows, and the end result is hard to distinguish from an infinite loop.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
4

Why are you matching whitespace separately from the other characters? And why are you anchoring the match at the beginning, but not at the end? If you want to make sure the string doesn't start or end with whitespace, you should do something like this:

^[A-Za-z0-9_.&,-]+(?:\s+[A-Za-z0-9_.&,-]+)*$

Now there's only one "path" the regex engine can take through the string. If it runs out of characters that match [A-Za-z0-9_.&,-] before reaching the end, and the next character doesn't match \s, it fails immediately. If it reaches the end while still matching whitespace characters, it fails because it's required to match at least one non-whitespace character after each run of whitespace.

If you want to make sure there's exactly one whitespace character separating the runs of non-whitespace, just remove the quantifier from \s+:

^[A-Za-z0-9_.&,-]+(?:\s[A-Za-z0-9_.&,-]+)*$

If you don't care where the whitespace is in relation to the non-whitespace, just match them all with the same character class:

^[A-Za-z0-9_.&,\s-]+$

I'm assuming you know that your regex won't match the given input because of the : and ( in the smiley, and you just want to know why it takes so long to fail.

And of course, since you're creating the regex in the form of a Java string literal, you would write:

"^[A-Za-z0-9_.&,-]+(?:\\s+[A-Za-z0-9_.&,-]+)*$"

or

"^[A-Za-z0-9_.&,-]+(?:\\s[A-Za-z0-9_.&,-]+)*$"

or

"^[A-Za-z0-9_.&,\\s-]+$"

(I know you had double backslashes in the original question, but that was probably just to get them to display properly, since you weren't using SO's excellent code formatting feature.)

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • "^[A-Za-z0-9_.&,-]+(?:\\s[A-Za-z0-9_.&,-]+)*$" does not match the same strings as the original, since it would not match a string with two consecutive spaces. But your regex "^[A-Za-z0-9_.&,-]+(?:\\s+[A-Za-z0-9_.&,-]+)*$" does and I like it much better than the possesive quantifier solution of Tim Pietzcker - the latter is too clever. :-) – Dr. Hans-Peter Störr Jul 21 '12 at 19:59