3

so im learning regex in java and was wondering why when i execute this code

String xxx = "(\\s+)?(c:/|c:\\\\|C:\\\\|C:/|c:\\|C:\\))?(\\w+(/|\\\\)?)+(/|\\\\)\\w+.[a-z]+";

String x = "C:\\Users\\esteban\\Desktop\\Java_file_testing\\file3.txt";

    if(x.matches(xxx)) {
        System.out.println("matches");
    }else {

            System.out.println("no match found ");
    }

this prints matches but when i remove the .txt is stays processing without any response, am i doing something wrong?

shep
  • 513
  • 2
  • 5
  • 17

2 Answers2

3

You stumbled upon a case of catastrophic backtracking !

When you write (\\w+(/|\\\\)?)+, you are basically introducing the (\\w+)+ pattern into your regex. This leaves the opportunity for the regex engine to match a same string in multiple ways (either using the inner or the outer +) - the number of possible paths increases exponentially, and since the engine has to try all possible ways of matching before declaring failure it takes forever to return a value.

Also, a few general comments on your regex:

  • c:\\| will match, literally, the string c:|
  • /|\\\\ is just [/\\\\]
  • (\s+)? is \s*
  • . is a wildcard ("anything but a newline") that need to be escaped
  • for the c/C variations, either use [cC] or make your whole regex case insensitive
  • when you don't need to actually capture values, using non-capturing groups (?:...) relieves the engine of some work

Taking these into account, a regex in the spirit of your first attempt could be:

\\s*(?:[cC]:[/\\\\])?(?:\\w+[/\\\\])*\\w+\\.[a-z]+

In (?:\\w+[/\\\\]), the character class [/\\\\] isn't optional any more, thus avoiding the (\\w+)+ pattern: see demo here.

For more information on catastrophic backtracking, I'd recommend the excellent (and fun !) article by Friedl on the subject on the perl journal.

Robin
  • 9,415
  • 3
  • 34
  • 45
1

You regex is using dot . character that matches [A-Za-z0-9_]

You have to escape the dot as:

(\\s+)?(c:/|c:\\\\|C:\\\\|C:/|c:\\|C:\\))?(\\w+(/|\\\\)?)+(/|\\\\)\\w+\\.[a-z]+
                                                          here --------^

Btw, you can shorten your regex like this:

\s*[Cc]:(?:(?:\/|\\{1,2})\w+)+\.\w+

Working demo

Remember to escape backslashes:

\\s*[Cc]:(?:(?:\\/|\\\\{1,2})\\w+)+\\.\\w+
Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
  • Hello. 1. I escaped the dot but the loop still keeps on taying there processing. 2.when i use this regex `\\s*c:(?:[\\/\\]{1,2}\\w+)+\\.\\w+` it throws an exception : ` Unclosed character class near index 27 \s*c:(?:[\/\]{1,2}\w+)+\.\w+ ` – shep Aug 24 '14 at 03:51
  • @user3710334 Have you tried with the second regex I posted? – Federico Piazza Aug 24 '14 at 03:53
  • @user3710334 You have to escape the regex as this: `\\s*[Cc]:(?:(?:\\/|\\\\{1,2})\\w+)+\\.\\w+` – Federico Piazza Aug 24 '14 at 03:55
  • ok and this doesn't throw exception but it never matches `\\s*c:(?:[\\/\\\\]{1,2}\\w+)+\\.\\w+`<< never matches `\\s*c:(?:[\\/\\]{1,2}\\w+)+\\.\\w+` << exception – shep Aug 24 '14 at 03:55
  • FYI `/` isn't a delimiter here, so there's no need to escape it. – Robin Aug 24 '14 at 04:08