2

I want to parse a String formatted like stated below with a regular expression in my method, however even though online RegEx tools like RegExr show that my expression should match it doesn't.

The expression I'm using is (@(\\d+))[(\r\n)\n](((0|1){"+width+"}[(\r\n)\n]){"+height+"}), where width and height are integer values for the required width and height of the text blocks.

The text blocks I want to retrieve from my file are formatted as follows:

@200
0000000000
0000011001
1100100000
0101001101
1110001110

@500
0000000000
0000011001
1100100000
0101001101
1110001110

etc.

(Here, width would be 10 and height 5)

Wanted to use the Matcher.find() method to retrieve each of those blocks, but the expression won't even find anything.

I suspect there is a problem with the way I'm handling line breaks, but when I want to try and use the new Java 8 \R universal linebreak escape character Eclipse shows the error "Invalid escape sequence".

  • 3
    `[(\r\n)\n]` does not do what you think it does. It's a single character class that will match `(` or `\r` or `\n` or `)` or `\n`. Listing `\n` twice is meaningless, so writing `[()\n\r]` would mean the same thing. You probably meant to use `\r?\n`, which means `\r\n` or `\n`. – Andreas Sep 16 '16 at 14:06
  • 1
    Also, you have way too many parenthesis, unless you truly want to capture all that. Without any need for capturing, the regex should be `"@\\d+\r?\n(?:[01]{"+width+"}\r?\n){"+height+"}"`. – Andreas Sep 16 '16 at 14:13
  • 3
    Or just use `\R` to denote a line break of any kind. – Holger Sep 16 '16 at 14:14
  • 1
    @Holger OP already tried using `\R`, so just saying to use it is not helpful. Now, telling OP to remember to Java-escape the ``\`` in a string literal would have been useful, so: **Maximilian,** remember to write `\\R`, since `\R` is an invalid Java escape sequence, but `"\\R"` becomes `\R` for the regex engine to see, and *it* supports the `\R` escape sequence (in Java 8+). – Andreas Sep 16 '16 at 14:21
  • Thank you! I thought that since (\r\n) was a group it should only be matched as a whole! Sadly this didn't fix my problem. – Maximilian Schirm Sep 16 '16 at 14:22
  • As @Andreas said, I already tried using `\R`, but double-escaping `\\R` seems to really have been the issue, as now it's working as intended! Thank you very much! (I assumed that since it was denoted as `\R` in the API I should use it that way) – Maximilian Schirm Sep 16 '16 at 14:27
  • `\R` is what the regex engine need to see. If you read the regex text from a file, that's exactly what it should be. When you write it as a Java String Literal, all ``\`` must be escaped (by doubling them), because ``\`` has special meaning. Similarly, any `"` in the regex text must be escaped as `\"`. Some characters, like `\r` and `\n` means the same in a string literal as in regex, so escaping the ``\`` is optional, e.g. `"\r"` and `"\\r"` is the same regex. – Andreas Sep 16 '16 at 14:32
  • 2
    That’s the problem when the OP doesn’t show actual source code. So it’s not clear whether we talk about regex syntax or Java string literal syntax. I usually use `"\\R"` when talking about the Java source code syntax and `\R` (without quotes) when talking about the regex syntax (or any other kind of DSL). The real fun starts when talking about generating or parsing source code, i.e. `"\"\\\\R\""` is needed… – Holger Sep 16 '16 at 14:50

1 Answers1

3

Just for completeness since escaping problem appeared in your question description: \ is special in String literals (in "..." part). Thanks to it we are able to write many characters which are normally not allowed in String like line separators. With \ we can write them as \r and \n (or via many other forms: hexadecimal index \uXXXX, octal index \OOO).
But because it is special we also need a way to write \ symbol itself. So to not provide another special character which will allow us to create \ literal we are using another \ to escape it like "\\". For instance"\r\n\\" literal represents 3 characters: carriage return, line feed and \.

That is why to create string literal representing \d so we could pass it to regex engine, we need to write it as "\\d".


Now back to main part of answer.

[..] is single character class. So it can match single character in described set. So:

  • since (..) is used to group series of characters which is not possible inside [..] ( and ) looses its meaning there making [(\r\n)\n] represent single ( or \r or \n or ) (notice that \r and \n represent single characters representing line break. Also another \n is redundant)

  • since \R beside single \r or \n (and few others) can also represents \r\n sequence, it can't be used inside [..] since character set may match only single character.

    If you use \R inside [..] you will get PatternSyntaxException: Illegal/unsupported escape sequence exception. Java usually allows to add \\ before any character inside character class to:

    • to represent predefined character classes: \\d \\w

    but also in cases where it doesn't change anything like:

    • \\r \\n \\t where it simply represents same characters as String literals "\r" "\n" "\t"
    • or before characters which doesn't have any special meaning so we don't really need to escape them \\x \\y \\h

    But it will not allow you to try escaping characters which have special meaning outside [...] and are not guaranteed to represent single characters like \R or \b (word boundaries since it doesn't represent character, but place before/after word.

What you can do is use \R instead of [(\r\n)\n] (but don't forget to also escape its \ part in String like you did for \d). You can also remove most outer (...) pair since entire match is already stored in group 0, so you don't need to add another group for that purpose.

One of simplest ways to rewrite your regex would be:

String regex = "@(\\d+)\\R([01]{"+width+"}\\R){"+height+"}";

But since you may not want to include last line separator feel free to make last \R optional with ? quantifier and reluctant by adding another ? after it like

String regex = "@(\\d+)\\R([01]{"+width+"}\\R??){"+height+"}";

DEMO

Community
  • 1
  • 1
Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • Nice recap of the comments. Wanna add my last comment too, to explain why `\R` doesn't work unescaped but `\r` does? – Andreas Sep 16 '16 at 14:35
  • 1
    @Andreas I am not sure if OP problem is lack of escaping of `\R` via `\\R`. I suspect that OP used `[\\R]` and since `\\R` can't be used there (since it can also represent `\r\n` pair as mentioned in answer) he got `Illegal/unsupported escape sequence` error. It is similar problem with `[\\b]` since `\b` doesn't represent single character but *place* (anchor). Will try to add it to answer. – Pshemo Sep 16 '16 at 14:40
  • I think the "Invalid escape sequence" was from the `\R` in a string literal, because OP said "Eclipse shows", which means it was an compiler error, while `[\\R]` would be a runtime error. Of course, you also see runtime errors in Eclipse, so you could be right. ;-) – Andreas Sep 16 '16 at 15:25
  • 1
    Yes, "Eclipse shows" can come from compilation, but OP could also be talking about stacktrace so it is kind of vague. But now after reading comments it looks like you ware right about `\R` vs `\\R` so thanks for pointing that out. – Pshemo Sep 16 '16 at 15:29