4

Hi I'm trying to understand in particular how does the OR operator "|" work in java regex. I'm just starting to use it and most of the things are unclear to me.

Suppose I wish to match fractions and integers, that is to say things of the form 1/2, 12/25, and also things of the form 13, 235, etc.

I have tried these 2 patterns:

pattern1 = Pattern.compile("\\d+|\\d+/\\d+"))
pattern2 = Pattern.compile("\\d+/\\d+|\\d+"))

In English, pattern1 says "digits OR digits/digits", whereas pattern2 says "digits/digits OR digits".

Now consider this input string:

inputStr = "blah... 231/232 blah... 4 blah... 2"

For pattern1, I found these matches:

[junit] found 231
[junit] found 232
[junit] found 4
[junit] found 2

For pattern2, I found these matches:

[junit] found 231/232
[junit] found 4
[junit] found 2

Now the only difference between pattern1 and pattern2 is the orders of its matched elements. Of course pattern2 is the one I wanted, as it seems to "prefer" a real faction than to take them apart.

So the most important question for me is this: Is this behaviour reliable/predictable, or is it going to be different for different platforms?

But also just curious... this question too: I also find it confusing because the operator "OR" should be symmetric with regard to its arguments, like addition. You'd imagine people be worried when 1+2 and 2+1 carries different semantics... is there any reason for pattern1 and pattern2 here to be semantically different?

Evan Pu
  • 2,099
  • 5
  • 21
  • 36

4 Answers4

7

| isn't just OR, it means "match the first thing, and if that fails, match the second thing".

Thus, you want to put the fraction first since it's the preferred form.

nneonneo
  • 171,345
  • 36
  • 312
  • 383
  • For reference, some regexp matchers match `|` greedily, i. e. the longest match. In that case you would not have that problem. I think it's especially *regular expression* matchers (as in those that correspond exactly to the regular languages) that do so. – ReyCharles Oct 10 '12 at 21:01
4

A more useful regex for your purpose would be \\d+(/\\d+)?which mandatorily checks for a group of digits and an optional group formed by a slash followed by digits.

Victor Mukherjee
  • 10,487
  • 16
  • 54
  • 97
2

The alternation operator is like a lazy-OR in that it will match the first thing it can. There are other posts on the topic that help clarify its behavior:

Java regex alternation operator "|" behavior seems broken

Why order matters in this RegEx with alternation?

In general, all regex's work this way... Except POSIX. So, portability should not be a concern in Java.

Community
  • 1
  • 1
jheddings
  • 26,717
  • 8
  • 52
  • 65
0

The | is called an alternation, and it gives the ability to list alternatives for a given match, and will stop on the first alternate pattern matched, from left to right. AFAIK, this is very consistent across all Java versions and the programming languages/tools I've used regexes with: Java, Perl, Python, PHP, sed.

mrk
  • 4,999
  • 3
  • 27
  • 42