5

I'm trying to use Java regexps to match a pattern that spans multiple lines. The pattern has one line that starts with 'A' followed by exactly 50 characters and then one or more lines that start with 'B' followed by exactly 50 characters:

A...    //  exactly 50 chars after the A
B...
B...

Java Regular Expressions don't seem to support this however.

Here is a regexp that works for one A and one B line:

A.{50}[\\n[\\n\\r]]B.{50}[\\n[\\n\\r]]

Here is the same regexp modified to find one or more B lines:

A.{50}[\\n[\\n\\r]][B.{50}[\\n[\\n\\r]]]+

This regexp only finds the leading B character on the first B line, however.

I use [\\n[\\r\\n]] to handle both DOS and UNIX newlines. Turning on MULTILINE mode doesn't affect the results.

The problem seems to be when I use the brackets with '+' to turn the regexp for a B line into a character class that can capture multiple lines.

Is there something about Java regexps that don't allow the '.' character or the curly brackets to specify an exact line length?

Thariama
  • 50,002
  • 13
  • 138
  • 166
Dean Schulze
  • 9,633
  • 24
  • 100
  • 165
  • There is one A line with 50 chars following the 'A', then multiple B lines with 50 chars following the leading 'B'. stackoverflow didn't preserve the newlines between the A and B lines that I showed above. – Dean Schulze Nov 22 '10 at 17:57

6 Answers6

0

To handle both Unix and Dos style newline you can use:

\\r?\\n

Also your way of grouping one or more B lines is incorrect, you are using [] for grouping, you should be using (?: ) instead.

Try this regex:

A.{50}\\r?\\n(?:B.{50}(?:\\r?\\n)?)+

Regex tested here

codaddict
  • 445,704
  • 82
  • 492
  • 529
  • Just for the sake of it since you posted a ruby version. Here's a great python version of a regex tester http://www.pythonregex.com/ – Falmarri Nov 22 '10 at 18:38
0

In the following regex:

(A[^\r\n]{50}(\r\n|\n))(B[^\r\n]{50}(\r\n|\n))+

I used [^\r\n] to match any character that is not \r or \n. You can replace it with [\d] if you have digits, for example.

See http://www.myregextester.com/?r=b7c3ca56

In the example, the regex matches all except the last line.

True Soft
  • 8,675
  • 6
  • 54
  • 83
0

This should work:

String input = "A1234567890\nA12345678\nA12345678\nB12345678\nA123456\nA1234567\nZA12345678\nB12345678\nA12345678\nB12345678\nB12345678\nB12345678\nB1234567\nA12345678\nB12345678";

String regex = "^A.{8}$((\\r|\\r\\n|\\n)^B.{8}$)+(\\r|\\r\\n|\\n|\\z)";

Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(input);

while (matcher.find()) {
System.out.println("matches from " + matcher.start() + " to " + matcher.end());
}

Note:

  1. the use of ^, $ and MULTILINE to avoid to match the line starting with "ZA".
  2. the use of (\\r|\\r\\n|\\n) to match unix, windows and old mac-os lines.
  3. the use of (\\r|\\r\\n|\\n|\\z) to match the last B line with no end-of-line

Opsss, I used 8 instead of 50 to increase readability.

andcoz
  • 2,202
  • 15
  • 23
0

The dot and the curly brackets work fine; it's the rest of your regex that's wrong. Check this out:

Pattern p = Pattern.compile("^A.{50}(?:(?:\r\n|[\r\n])B.{50})+$");

(?:\r\n|[\r\n]) matches a CRLF sequence, CR only, or LF only. (I could have used two backslashes each like you did, but this works too).

If you're using the regex to pluck matches out of some larger text, you'll want to compile it in MULTILINE mode so the ^ and $ anchors can match at line boundaries. If it's supposed to match a whole string, leave it in the default mode so they only match at the beginning and end of the string.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
0

The correct way to match a linebreak sequence is:

"(?:(?>\\u000D\\u000A)|[\\u000A\\u000B\\u000C\\u000D\\u0085\\u2028\\u2029)"

That’s in Java’s slackbashy string notation, of course, just as you might pass to Pattern.compile. More reasonable languages allow you to get by with simply this:

(?:(?>\x0D\x0A)|\v)

But then, Java’s regexes have never been anything like reasonable, and even this is actually a gross understatement for how bad they really are. Java’s poor support for whitespace detection is just one of its regexes’ innumerable trouble-spots.

Good luck: you’ll need it. ☹

Community
  • 1
  • 1
tchrist
  • 78,834
  • 30
  • 123
  • 180
0

This should work too:

Pattern regex = Pattern.compile("^A.{50}$\\s+(?:^B.{50}$\\s*)+(?:^|\\z)", Pattern.MULTILINE);

The reasoning behind this is that ^ matches at the start of the line, $ matches at the end of the line, before an (optional) newline character, and \s matches whitespace which includes \r and \n. Since we're using it between $ and ^, it can only match newline characters and not other whitespace.

The (?:^|\\z) is used to make sure that we don't accidentally match any leading spaces in the line following the last repetition of the B line. If lines never start with whitespace, you can drop this bit.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561