2

I've written a Java class which must pull elements out of a string containing newlines. As a first step the code must split the input string by newline and place the results into an array. This is all working fine except in one specific case. I'm using the following code to perform the split:

String lines[] = inputText.split("[\\r?\\n\\r]+");

The issue I'm having is with the following line:

##INFO=<ID=DS,Number=0,Type=Flag,Description=""removed?"">"

It results in two lines:

##INFO=<ID=DS,Number=0,Type=Flag,Description=""removed
"">"

It is splitting on the question mark. Could anyone point me in the right direction as to why this is happening? Inside a regex doesn't a '?' indicate 0 or 1 occurrence? Is this not an acceptable way to split by newline?

Memento Mori
  • 3,327
  • 2
  • 22
  • 29
  • does `split('\\n')` not work? – Karthik T Feb 17 '13 at 09:06
  • 2
    ? does mean 0 or 1 but not inside a [] group, which then means a literal question mark, hence your strange result – Adam Feb 17 '13 at 09:07
  • It would yes, but I also have to be able to handle Windows style newlines. I thought I was being safe writing the regex like this. But possibly not! – Memento Mori Feb 17 '13 at 09:08
  • @BenShirley you can look at http://stackoverflow.com/questions/247059/is-there-a-newline-constant-defined-in-java-like-environment-newline-in-c to make it platform independant maybe? – Karthik T Feb 17 '13 at 09:09
  • Isn't `[\n\r]+` just enough ? – vadchen Feb 17 '13 at 09:11
  • Thanks for the comments all. I'll try removing the [] group @Karthik T thanks for the link. An issue is that the string is coming from a file which could have been created on a system other than the computer performing the split. – Memento Mori Feb 17 '13 at 09:12
  • You should split on `\r`, `\n`, and `\r\n`. So change your split to: `inputText.split("\r|\r?\n");` – Rohit Jain Feb 17 '13 at 09:12
  • @Rohit Jain Thank you very much. Your suggested split seems to be working perfectly. – Memento Mori Feb 17 '13 at 09:17
  • @Adam Good explanation. I understand what was happening now. Thank you! – Memento Mori Feb 17 '13 at 09:17

3 Answers3

0

This works, simply say \n\r OR \n.

String manyLines = "line1\nline2\n\rline3?\nline4";
System.out.println(Arrays.asList(manyLines.split("\\n\\r|\\n")));

Output

[line1, line2, line3?, line4]
Adam
  • 35,919
  • 9
  • 100
  • 137
0

The question mark inside square brackets is literally a question mark. Replace square brackets with round ones(the former is limited to ONE character for each alternation):

String lines[] = inputText.split("(\\r?\\n|\\r)+");

Lines will be split at "\r\n", "\n" and "\r", but that's effectively the same as:

String lines[] = inputText.split("(\\n|\\r)+");

So we can back to square brackets:

String lines[] = inputText.split("[\\n\\r]+");

If what you actually need is a constant newline depending on OS:

String lines[] = inputText.split("(" + System.getProperty("line.separator") + ")+");
Hui Zheng
  • 10,084
  • 2
  • 35
  • 40
0

You are using a character class([]), which means any of the characters inside the brackets, so in your case [\\r?\\n\\r]+, it means any of \\r, \\n, ?, \\r, one or more times (+).

The real portable regex for a newline, defined by Unicode UTS #18: Unicode Regular Expressions, is:

\R => (?:(?>\u000D\u000A)|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029])

as explained at this answer by Tom Christiansen of Perl fame. Accounting for Java's double escaping (string then regex):

(?:(?>\\u000D\\u000A)|[\\u000A\\u000B\\u000C\\u000D\\u0085\\u2028\\u2029])
Community
  • 1
  • 1
ninjalj
  • 42,493
  • 9
  • 106
  • 148