2

I don't know why but when I try to run this program, it's looks like the program will run forever.

package fjr.test;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Test3 {

    public static void main(String[] args){

        String regex = "dssdfsdfdsf wdasdads dadlkn mdsds .";

        Pattern p  = Pattern.compile("^([a-zA-Z]+ *)+$"); 

        Matcher match =  p.matcher(regex); 

        if(match.matches()){
            System.out.println("Yess"); 
        }else{
            System.out.println("No.. "); 
        }

        System.out.println("FINISH..."); 
    }

}

What I need to do was to match the pattern that contain a bunch of word that only separated by spaces

Mohammad Fajar
  • 957
  • 1
  • 12
  • 24

4 Answers4

8

Your program has likely encountered what's called catastrophic backtracking.

If you have a bit of time, let's look at how your regex works...

Quick refresher: How regex works: The state machine always reads from left to right, backtracking where necessary.

On the left hand side, we have our pattern:

/^([a-zA-Z]+ *)+$/

And here's the String to match:

dssdfsdfdsf wdasdads dadlkn mdsds .

From the regex101 debugger, your regex took 78540 steps to fail. This is because you used quantifiers that are greedy and not possessive (backtracking).

x

... Long story short, because the input string fails to match, every quantifier within your regex causes indefinite backtracking - Every character is released from + and then * and then both and then a group is released from ()+ to backtrack more.

Here's a few solutions you should follow:

Avoid abundant quantifiers!

If you revise your expression, you'll see that the pattern is logically same as:

/^[a-zA-Z]+( +[a-zA-Z]+)*$/

This uses a step of logical induction to reduce the regex upstairs to match far quicker, now at 97 steps!

y

Use possessive quantifiers while you can!

As I mentioned, /^([a-zA-Z]+ *)+$/ is evil because it backtracks in a terrible manner. We're in Java, what can we do?

This solution works only because [a-zA-Z] and matches distinct items. We can use a possessive group!

/^([a-zA-Z]++ *+)++$/
            ^  ^  ^

These simple "+" denotes "We're not backtracking if we fail the match from here". This is an extremely effective solution, and cuts off any need for backtracking. Whenever you have two distinct groups with a quantifier in between, use them. And if you need some proof on the effectiveness, here's our scorecard:

z

Read also:

Online Demos:

Community
  • 1
  • 1
Unihedron
  • 10,902
  • 13
  • 62
  • 72
2

This does terminate, but it does take 10 seconds or so. Some observations:

  • Removing the fullstop from the end of the test string makes it fast.
  • Changing the * in the regex to a + (which i believe is actually what you want) makes it fast. I think having the option of 0 characters in that spot expands the state space a lot. I would use:

    ^(\w+ +)*\w+$"

Which means a bunch of (word space), followed by a word. Ran against your example and it is fast

triggerNZ
  • 4,221
  • 3
  • 28
  • 34
1

You can use this regex to match all word with spaces or none.

sample:

 Pattern p  = Pattern.compile("([a-zA-Z ]+)"); 
Rod_Algonquin
  • 26,074
  • 6
  • 52
  • 63
0

Interesting phenomena. This is related on Greedy quantifiers's behaviour & performance.
Based on your pattern ^([a-zA-Z]+ *)+$ , and your inpput "dssdfsdfdsf wdasdads dadlkn mdsds ." , the patther doesn't match your input, however, the java regex will backtrack all the possibilities of ^([a-zA-Z]+ *)+, (see below examples) and then obtain the not-match results.

(dssdfsdfdsf )(wdasdads )(dadlkn )(mdsds )
(dssdfsdfdsf )(wdasdads )(dadlkn )(mdsd)(s )
(dssdfsdfdsf )(wdasdads )(dadlkn )(mds)(ds )
...
(d)(s)(s)(d)(f)(s)(d)(f)(d)(s)(f )(w)(d)(a)(s)(d)(a)(d)(s )(d)(a)(d)(l)(k)(n )(m)(d)(s)(d)(s )
...
(dssdfsdfdsf )
...
(d)

The backtrack could be more than 200,000,000 times.
I'm a little bit curious why java-regex can't do some performance improvement, since after first try, it found the epxected char is '.', not '$', then any further backtrack won't be successful. it's useless to do the backtrack.

Therefore, when we define the Loop pattern, we need to pay more attenttion on the internal Loop pattern (e.g.: [a-zA-Z]+ * in your example), not making it matching so many case. Otherwise, the backtrack numbers for the whole Loop will be huge.
Another similar exmaple (more bad then your case):
Pattern: "(.+)*A"
Input: "abcdefghijk lmnopqrs tuvwxy zzzzz A" -- So quick
Input: "abcdefghijAk lmnopqrs tuvwxy zzzzz " -- Not bad, just wait for a while
Input: "aAbcdefghijk lmnopqrs tuvwxy zzzzz " -- it seems infinit. (acutally, not, but I have no patience to wait its finishing.)

jawee
  • 271
  • 2
  • 11