0

I'm trying to use some very simple regex for a delimiter in the Scanner class. Comparing the two lines below:

Pattern pattern = Pattern.compile("\\r\\n|\\s");

and

Pattern pattern = Pattern.compile("\\s|\\r\\n");

Assuming that '|' acts as the OR operator, it is known that A | B = B | A. What could be the reason I'm getting different results?

Thanks in advance.

The first one gives me:

one
two
three

while the second one gives me:

one

two

three

The file is a text file with:

one[CR][LF]
two[CR][LF]
three

Please check the code I'm using below:

String d = "";
Pattern pattern = Pattern.compile("\\s|\\r\\n");

try (Scanner sc = new Scanner(new FileInputStream("input/input.txt")).useDelimiter(pattern)) {

while (sc.hasNext()) {
    d = sc.next();
    System.out.println(d);
}

I don't believe this is a duplicate question as mentioned for two reasons: First, and referring back to the supposed duplicated post, 'foo' is explicitly contained in 'foobar'; \r \n are not explicitly contained in \s. Second, it is not obvious that C# regex works the same way as Java. I've checked the post you mentioned before posting the question here and it didn't answer my question.

  • 3
    The first alternative will be preferred if it’s found before the second one. – Sebastian Simon Jan 12 '17 at 16:07
  • 1
    What results do you get? Note that `\s|\r\n` pattern is built so that `\r\n` never matches anything since Java regex is a regular NFA and the first alternative in a non-anchored alternation group "wins" and regex stops processing subsequent branches. See [Remember That The Regex Engine Is Eager](http://www.regular-expressions.info/alternation.html). – Wiktor Stribiżew Jan 12 '17 at 16:08
  • @Nelsão Please [edit] your question instead of providing this information in comments. Also, see [this answer in “Why does the order of alternatives matter in regex?”](http://stackoverflow.com/a/18017758/4642212). – Sebastian Simon Jan 12 '17 at 16:13
  • @Xufox: In the article you provided me, the order matters because one of the regular expressions is contained in the other. When this doesn't happens, the order shouldn't matter. –  Jan 12 '17 at 16:16
  • What do you mean _gives me_? You've just compiled a `Pattern`, how are you using it? – Sotirios Delimanolis Jan 12 '17 at 16:16
  • @Nelsão This is the case with your regex, too. – Sebastian Simon Jan 12 '17 at 16:18
  • @Xufox: Can you explain me why? One of them, '\s' is a white space. The other one, '\r\n' is a new line in Windows text file format. How can one be included in the other? Can you explain me what I am missing here? –  Jan 12 '17 at 16:23
  • @Wiktor: You answer uses jargon that I'm not familiar with, like NFA and non-anchored alternation group "wins". I've also checked the article you provided but again, it is same as Xufox, which is only valid when one regex includes the other. So, why would \r\n include \s or vice versa? –  Jan 12 '17 at 16:30
  • 1
    `\s` matches `\r` and `\n`. Thus, `\s|\r\n` will match `\r`, then `\n` in `text1\r\ntext2`, and `\r\n` won't be even tried. You should swap the alternatives: `"\\r\\n|\\s"`. The most specific pattern, the longer one, should always come first in an unanchored alternation group. "Unanchored" means there is nothing on both sides of the alternation pattern. – Wiktor Stribiżew Jan 12 '17 at 16:31
  • Looks like http://stackoverflow.com/questions/18017661/why-does-the-order-of-alternatives-matter-in-regex/18017758#18017758 can be used to close this one as a duplicate. It is the same situation here. – Wiktor Stribiżew Jan 12 '17 at 16:37

0 Answers0