1

I'm learning Regex, and running into trouble in the implementation.

I found the RegexTestHarness on the Java Tutorials, and running it, the following string correctly identifies my pattern:

[\d|\s][\d]\.

(My pattern is any double digit, or any single digit preceded by a space, followed by a period.)

That string is obtained by this line in the code:

Pattern pattern = 
        Pattern.compile(console.readLine("%nEnter your regex: "));

When I try to write a simple class in Eclipse, it tells me the escape sequences are invalid, and won't compile unless I change the string to:

[\\d|\\s][\\d]\\.

In my class I'm using`Pattern pattern = Pattern.compile(); When I put this string back into the TestHarness it doesn't find the correct matches.

Can someone tell me which one is correct? Is the difference in some formatting from console.readLine()?

nexus_2006
  • 744
  • 2
  • 14
  • 29
  • 2
    You need to understand how string literals and _Java_ string escapes work. – SLaks Dec 18 '13 at 21:38
  • also, your regex doesn't enforce the space before double-digit numbers. Use anubhava's regex. – Gus Dec 18 '13 at 21:46
  • I'm only looking for space before single digit numbers, or double digit numbers (don't care what preceded by). – nexus_2006 Dec 18 '13 at 21:48

4 Answers4

3

\ is special character in String literals "...". It is used to escape other special characters, or to create characters like \n \r \t.
To create \ character in string literal which can be used in regex engine you need to escape it by adding another \ before it (just like you do in regex when you need to escape its metacharacters like dot \.). So String representing \ will look like "\\".

This problem doesn't exist when you are reading data from user, because you are already reading literals, so even if user will write in console \n it will be interpreted as two characters \ and n.


Also there is no point in adding | inside class character [...] unless your intention is to make that class also match | character, remember that [abc] is the same as (a|b|c) so there is no need for | in "[\\d|\\s]".

Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • Thank you. This explains why the pattern is different when entered via the sample program/command line, vs at compile time. – nexus_2006 Dec 18 '13 at 22:00
2

My pattern is any double digit or single digit preceded by a space, followed by a period.)

Correct regex will be:

Pattern pattern = Pattern.compile("(\\s\\d|\\d{2})\\.");

Also if you're getting regex string from user input then your should call:

Pattern.quote(useInputRegex);

To escape all the regex special characters.

Also you double escaping because 1 escape is handled by String class and 2nd one is passed on to regex engine.

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
anubhava
  • 761,203
  • 64
  • 569
  • 643
2

If you want to represent a backslash in a Java string literal you need to escape it with another backslash, so the string literal "\\s" is two characters, \ and s. This means that to represent the regular expression [\d\s][\d]\. in a Java string literal you would use "[\\d\\s][\\d]\\.".

Note that I also made a slight modification to your regular expression, [\d|\s] will match a digit, whitespace, or the literal | character. You just want [\d\s]. A character class already means "match one of these", since you don't need the | for alternation within a character class it loses its special meaning.

Andrew Clark
  • 202,379
  • 35
  • 273
  • 306
0

What is happening is that escape sequences are being evaluated twice. Once for java, and then once for your regex.

the result is that you need to escape the escape character, when you use a regex escape sequence.

for instance, if you needed a digit, you'd use

"\\d"