45

Why does this pattern fail to compile :

Pattern.compile("(?x)[ ]\\b");

Error

ERROR java.util.regex.PatternSyntaxException:
Illegal/unsupported escape sequence near index 8
(?x)[ ]\b
        ^
at java_util_regex_Pattern$compile.call (Unknown Source)

While the following equivalent ones work?

Pattern.compile("(?x)\\ \\b");
Pattern.compile("[ ]\\b");
Pattern.compile(" \\b");

Is this a bug in the Java regex compiler, or am I missing something? I like to use [ ] in verbose regex instead of backslash-backslash-space because it saves some visual noise. But apparently they are not the same!

PS: this issue is not about backslashes. It's about escaping spaces in a verbose regex using a character class containing a single space [ ] instead of using a backslash.

Somehow the combination of verbose regex (?x) and a character class containing a single space [ ] throws the compiler off and makes it not recognize the word boundary escape \b


Tested with Java up to 1.8.0_151

200_success
  • 7,286
  • 1
  • 43
  • 74
Tobia
  • 17,856
  • 6
  • 74
  • 93
  • 6
    Not that it would solve the question, but how is a character class, containing just a space, different from a literal space? – user unknown Mar 13 '18 at 19:45
  • 4
    @userunknown: The `x` flag (enabled by the OP's `(?x)`) causes whitespace and comments to be ignored; so `(?x)a b` is equivalent to `ab`, whereas `(?x)a\ b` is equivalent to `a b`. As Socowi explains in his/her answer, the problem is that the OP expected `(?x)a[ ]b` to be equivalent to `a[ ]b` (i.e. to `a b`), when in fact it's equivalent to `a[]b` (which is invalid). – ruakh Mar 14 '18 at 06:12
  • 2
    @ruakh Exactly. In all other PCRE engines `[ ]` is a valid way to escape spaces in verbose regex, see for example Perl: `echo 'a b' | perl -lne 'print if /a[ ]b/x'` or libpcre: `echo 'a b' | pcregrep '(?x)a[ ]b'` – Tobia Mar 14 '18 at 08:53

5 Answers5

31

I like to use [ ] in verbose regex instead of backslash-backslash-space because it saves some visual noise. But apparently they are not the same!

"[ ]" is the same as "\\ " or even " ".

The problem is the (?x) at the beginning enabling comments mode. As the documentation states

Permits whitespace and comments in pattern.
In this mode, whitespace is ignored, and embedded comments starting with # are ignored until the end of a line.
Comments mode can also be enabled via the embedded flag expression (?x).

In comments mode the regex "(?x)[ ]\\b" is the same as "[]\\b" and won't compile because the empty character class [] is not parsed as empty, but parsed like "[\\]" (unclosed character class containing a literal ]).

Use " \\b" instead. Alternatively, preserve the space in comments mode by escaping it with a backslash: "(?x)[\\ ]\\b" or "(?x)\\ \\b".

Socowi
  • 25,550
  • 3
  • 32
  • 54
22

This is a bug in Java's peekPastWhitespace() method in the Pattern class. Tracing this entire issue down... I decided to take a look at OpenJDK 8-b132's Pattern implementation. Let's start hammering this down from the top:

  1. compile() calls expr() on line 1696
  2. expr() calls sequence() on line 1996
  3. sequence() calls clazz() on line 2063 since the case of [ was met
  4. clazz() calls peek() on line 2509
  5. peek() calls peekPastWhitespace() on line 1830 since if(has(COMMENTS)) evaluates to true (due to having added the x flag (?x) at the beginning of the pattern)
  6. peekPastWhitespace() (posted below) skips all spaces in the pattern.

peekPastWhitespace()

private int peekPastWhitespace(int ch) {
    while (ASCII.isSpace(ch) || ch == '#') {
        while (ASCII.isSpace(ch))
            ch = temp[++cursor]
        if (ch == '#') {
            ch = peekPastLine();
        }
    }
    return ch;
}

The same bug exists in the parsePastWhitespace() method.

Your regex is being interpreted as []\\b, which is the cause of your error because \b is not supported in a character class in Java. Moreover, once you fix the \b issue, your character class also doesn't have a closing ].

What you can do to fix this problem:

  1. \\ As the OP mentioned, simply use double backslash and space
  2. [\\ ] Escape the space within the character class so that it gets interpreted literally
  3. [ ](?x)\\b Place the inline modifier after the character class
ctwheels
  • 21,901
  • 9
  • 42
  • 77
  • It does seem that PHP and Python parse it differently with the [ ] being considered a literal space despite the extended mode, according to regex101.com. I suppose it's fair to call this a bug, based upon that. Are there any other references that we could use to definitively say it's a bug? – Corrodias Mar 13 '18 at 21:06
  • 2
    Perl also interprets [ ] as a literal space even in (?x) mode (and this is specifically mentioned in `perlre(1p)`: "a bracketed character class is unaffected by /x"), and Perl _invented_ (?x) mode, so I think that should be dispositive: it's a bug. – zwol Mar 14 '18 at 00:21
  • 3
    OP here. I've been writing extended/verbose regex in Perl, Python, PHP,, libpcre, and other "PCRE" flavours for years. This is the first time I have seen whitespace being skipped over in a character class. If Java's regex are to be Perl- and PCRE-compatible then yes, this a bug in the code. Otherwise it's a bug in the documentation, because it doesn't point out this deviation from the de-facto standard. – Tobia Mar 14 '18 at 09:00
  • How to match `#`? – Nils Lindemann Sep 03 '18 at 20:12
  • @Nils escaping it doesn’t work? Doesn’t look like you can use it otherwise, you’ll have to use the inline modifier – ctwheels Sep 03 '18 at 20:20
  • @ctwheels Yes, you are right. I was testing this using a Scala online compiler (Scala uses Java under the hood), but this wouldnt work. Now testing it locally, both works, `(?x)\#` and `(?x)(?-x:#)` – Nothing is as stable and usable as the command line! – Nils Lindemann Sep 03 '18 at 20:55
12

It looks like because of free-spacing (verbose) mode (?x) space in [ ] is ignored, so regex engine sees your regex as []\\b.
If we remove \\b it would be seen like [] and we would get error about Unclosed character class - character class can't be empty so ] placed directly after [ is treated as first character which belongs to that class instead of meta symbol which is closing character class.

So since [ is unclosed, regex engine sees \b as being placed inside that character class. But \b can't be placed there (it doesn't represent character but "place") so we are seeing error about "unsupported escape sequence" (inside character class, but that part was skipped).

In other words you can't use [ ] to escape space in verbose mode (at least in Java). You would need to either use "\\ " or "[\\ ]".

Pshemo
  • 122,468
  • 25
  • 185
  • 269
5

A workaround

Beside escaping whitespaces separately that are literally the same as [ ], you could have x mode on for entire regex but disable it while working on patterns that need whitespaces, inline:

(?x)match-this-(?-x: with spaces )\\b
    ^^^^^^^^^^^     ^^^^^^^^^^^^^ ^^^
    `x` is on            off       on

or an alternative would be using qouting meta-characters \Q...\E:

(?x)match-this-\Q with s p a c e s \E\\b
    ^^^^^^^^^^^  ^^^^^^^^^^^^^^^^^^  ^^^
    `x` is on            off          on

Why an Exception?

In extended or comment mode (x) whitespaces are ignored but dealing with spaces within character classes in various flavors is handled differently.

For example in PCRE all whitespace characters are ignored except those in a character class. That means [ ] is a valid regex but Java doesn't have an exception:

In this mode, whitespace is ignored...

Period. So this [ ] is equal to this [] which is not valid and throws a PatternSyntaxException exception.

Almost all regex flavors except JavaScript needs a character class to have at least one data unit. They treat an empty character class as an unclosed set which needs a closing bracket. Saying that, []] is valid in most flavors.

Free spacing mode in defferent flavors on [ ]:

  • PCRE valid
  • .NET valid
  • Perl valid
  • Ruby valid
  • TCL valid
  • Java 7 Invalid
  • Java 8 Invalid
revo
  • 47,783
  • 14
  • 74
  • 117
5

Lets analyse what happen exactly.

Take a look at the source code of java.util.regex.Pattern

Permits whitespace and comments in pattern. In this mode, whitespace is ignored, and embedded comments starting with # are ignored until the end of a line.

Comments mode can also be enabled via the embedded flag expression (?x).

Your regex guide you to this line

private void accept(int ch, String s) {
    int testChar = temp[cursor++];
    if (has(COMMENTS))
        testChar = parsePastWhitespace(testChar);
    if (ch != testChar) {
        throw error(s);
    }
}

If you notice your code call parsePastWhitespace(testChar);

private int parsePastWhitespace(int ch) {
    while (ASCII.isSpace(ch) || ch == '#') {
        while (ASCII.isSpace(ch))//<----------------Here is the key of your error
            ch = temp[cursor++];
        if (ch == '#')
            ch = parsePastLine();
    }
    return ch;
}

In your case you have a white space in your regular expression (?x)[ ]\\b this will return something (I can't analyse it correctly) :

    if (ch != testChar) {
        throw error(s);
    }

which is not equal to ch and here a exception is throws

throw error(s);
Bourbia Brahim
  • 14,459
  • 4
  • 39
  • 52
Youcef LAIDANI
  • 55,661
  • 15
  • 90
  • 140