78

So, some way or another (playing around), I found myself with a regex like \d{1}{2}.

Logically, to me, it should mean:

(A digit exactly once) exactly twice, i.e. a digit exactly twice.

But it, in fact, appears to just mean "a digit exactly once" (thus ignoring the {2}).

String regex = "^\\d{1}{2}$"; // ^$ to make those not familiar with 'matches' happy
System.out.println("1".matches(regex)); // true
System.out.println("12".matches(regex)); // false

Similar results can be seen using {n}{m,n} or similar.

Why does this happen? Is it explicitly stated in regex / Java documentation somewhere or is it just a decision Java developers made on-the-fly or is it maybe a bug?

Or is it in fact not ignored and it actually means something else entirely?

Not that it matters much, but it's not across-the-board regex behaviour, Rubular does what I expect.

Note - the title is mainly for searchability for users who want to know how it works (not why).

Bernhard Barker
  • 54,589
  • 14
  • 104
  • 138
  • 57
    Your pattern means (a digit exactly once) followed by (nothing exactly twice). – GOTO 0 Sep 23 '13 at 12:09
  • 3
    If it helps, both `pcregrep` and *Mathematica* give errors for this regex like `pcregrep: Error in command-line regex at offset 8: nothing to repeat`. I would either just use `{m*n}`, or I would use `(?:\\d{1}){2}`, which is unambiguous. – Jeremy Sep 23 '13 at 12:10
  • 1
    I don't understand why can't you just use `\d{2}`? Is there any difference in what you are trying to achieve? – Carlos Campderrós Sep 23 '13 at 14:00
  • 5
    @CarlosCampderrós Well, the only thing I'm really trying to achieve is a better understanding of regex. The problem is more theoretical, I'm interested in finding out why it works the way it does as opposed to finding a regex that works for the example. – Bernhard Barker Sep 23 '13 at 14:06
  • @GOTO0 If that is indeed the interpretation, it is very counter-intuitive. You see R{1} is a regex, correct? And so R{1}{2} applies {2} to a regex. It either means (R{1}){2}, or else it must mean R({1}{2}) which is nonsense, because this associativity treats the `{1}` operator as a unit to which the `{2}` operator is applied. That associativity could work if there is semantic a rule for combining these operators. – Kaz Sep 24 '13 at 00:12
  • 2
    @Kaz Not at all: curly repetitions in Java only apply to single nodes (including empty nodes) or groups, not to other repetitions. You can create that pattern and inspect its `matchRoot` with a debugger if you don't believe me. A look at the source code of the method [`Pattern.closure`](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7-b147/java/util/regex/Pattern.java#3028) will also provide you with some insights. – GOTO 0 Sep 24 '13 at 14:46

7 Answers7

108

IEEE-Standard 1003.1 says:

The behavior of multiple adjacent duplication symbols ( '*' and intervals) produces undefined results.

So every implementation can do as it pleases, just don't rely on anything specific...

piet.t
  • 11,718
  • 21
  • 43
  • 52
  • 1
    +1, but do you know if Java officially complies with this standard? – Bernhard Barker Sep 23 '13 at 12:29
  • 2
    yes because the output result is valid by standard, ie: it can do anything at all. – STT LCU Sep 23 '13 at 12:33
  • 2
    @Dukeling I believe so as well. Notice `System.out.println("".matches("^{1}$"));` returns `true` as well. My bet is that if Java cannot find a valid pattern to repeat, it will repeat `null` instead of throwing an error (which matches anywhere in a string). Also, you used a Ruby based regex tester for Java!? – Jerry Sep 23 '13 at 13:07
  • 3
    @STTLCU Well, there's a difference between officially and not officially complying or not complying. Officially complying means it can be cited as a source, otherwise it's still a nice reference, but doesn't necessarily explain why Java does what it does. – Bernhard Barker Sep 23 '13 at 13:15
  • @Jerry I just used a Ruby-based regex tester as a matter of curiosity to see what other regex engines do, nothing more. – Bernhard Barker Sep 23 '13 at 13:20
  • @Dukeling Oh, okay, out of all the languages out there, I wouldn't have thought of testing it in Java! xD C# and Python would've been my first try. – Jerry Sep 23 '13 at 13:32
  • 1
    There is UB in regex because the specs, in large part, standardize what is out there, and what is out there is software implementations whose regression test suites have coverage holes. You get UB when there are cases nobody has tested, and everyone's code does something else. The spec writers come along, make a note of the situation and say things like "this document does not give a requirement for such and such case". – Kaz Sep 24 '13 at 00:16
  • @Jerry - Good point! But in your pattern, `{1}` is applied to `^`. You can confirm it using `^{0}` - the anchor is ignored: http://fiddle.re/9zhan . Still, you were right, it seems `"{1}"` is a valid pattern in Java. – Kobi Sep 24 '13 at 07:46
  • @Jerry Of course, `matches()` cannot work for that pattern and string. `find` and `replace` show how the pattern is parsed: as if `^{0}` wasn't there at all. – Kobi Sep 24 '13 at 07:55
  • @Jerry - We're probably agreeing but talking about different things... I'm talking about my example: `matches("World")` would return `false` for `"Hello World"`, because it expect a whole match - from start to end, like `^World$`. `find("World")` does return a match. `find("^{0}World")` also returns a match, showing that `{0}` works on `^`. – Kobi Sep 24 '13 at 09:53
  • @Kobi Oops, sorry, I think I was going astray there xD Thanks! – Jerry Sep 24 '13 at 10:15
  • 3
    I'm quite sure this standard is for POSIX BRE and ERE, and it has **nothing** to do with Java regex. Java doesn't even claim to support ERE or BRE! If anything, Unicode Regular Expression http://unicode.org/reports/tr18/ should be cited here. – nhahtdh Apr 21 '15 at 10:18
76

When I input your regex in RegexBuddy using the Java regex syntax, it displays following message

Quantifiers must be preceded by a token that can be repeated «{2}»

Changing the regex to explicitly use a grouping ^(\d{1}){2} solves that error and works as you expect.


I assume that the java regex engine simply neglects the error/expression and works with what has been compiled so far.

Edit

The reference to the IEEE-Standard in @piet.t's answer seems to support that assumption.

Edit 2 (kudos to @fncomp)

For completeness, one would typically use (?:)to avoid capturing the group. The complete regex then becomes ^(?:\d{1}){2}

Community
  • 1
  • 1
Lieven Keersmaekers
  • 57,207
  • 13
  • 112
  • 146
  • If `\d{1}{2}` does not mean `(\d{1}){2}`, then what does it mean? If the associativity is not left to right, then it must be right to left, and so it means `\d({1}{2})`, which is meaningless unless we define what it means to clump two of these braced operators. – Kaz Sep 24 '13 at 00:14
  • @Kaz - OP's test show that the second duplication symbol is not evaluated using Java's regular expression engine. I believe piet.t is right on the mark that every implementation can do as it pleases. – Lieven Keersmaekers Sep 24 '13 at 05:27
  • 5
    Wouldn't `^(:?\d{1}){2}$` be a more precise reproduction of the intent? (In order to avoid capturing.) – fncomp Sep 24 '13 at 19:48
  • 2
    @fncomp - It would, that's also what I used. Small typo though - it should be `(?: )` – Kobi Sep 25 '13 at 10:36
  • @fncomp - I've been dabbling over that myself. It's better performance wise but it's not as concise. As intent goes, the outcome is the same so that didn't bother me. I have added your comment to the answer for completeness. – Lieven Keersmaekers Sep 25 '13 at 11:35
  • @Kobi Ah, I always seem to do that when I don't test my groups. `^(?:\d{1}){2}$` – fncomp Sep 25 '13 at 17:45
10

Scientific approach:
click on the patterns to see the example on regexplanet.com, and click on the green Java button.

  • You've already showed \d{1}{2} matches "1", and doesn't match "12", so we know it isn't interpreted as (?:\d{1}){2}.
  • Still, 1 is a boring number, and {1} might be optimized away, lets try something more interesting:
    \d{2}{3}. This still only matches two characters (not six), {3} is ignored.
  • Ok. There's an easy way to see what a regex engine does. Does it capture?
    Lets try (\d{1})({2}). Oddly, this works. The second group, $2, captures the empty string.
  • So why do we need the first group? How about ({1})? Still works.
  • And just {1}? No problem there.
    It looks like Java is being a little weird here.
  • Great! So {1} is valid. We know Java expands * and + to {0,0x7FFFFFFF} and {1,0x7FFFFFFF}, so will * or + work? No:

    Dangling meta character '+' near index 0
    +
    ^

    The validation must come before * and + are expanded.

I didn't find anything in the spec that explains that, it looks like a quantifier must come at least after a character, brackets, or parentheses.

Most of these patterns are considered invalid by other regex flavors, and for a good reason - they do not make sense.

Community
  • 1
  • 1
Kobi
  • 135,331
  • 41
  • 252
  • 292
4

At first I was surprised this doesn't throw a PatternSyntaxException.

I can't base my answer on any facts, so this is just an educated guess:

"\\d{1}"    // matches a single digit
"\\d{1}{2}" // matches a single digit followed by two empty strings
jlordo
  • 37,490
  • 6
  • 58
  • 83
4

I have never seen the {m}{n} syntax anywhere. It seems that the regex engine on this Rubular page applies the {2} quantifier to the smallest possible token before that - which is \\d{1}. To mimick this in Java (or most other regex engines, it would seem), you need to group the \\d{1} like so:

^(\\d{1}){2}$

See it in action here.

zb226
  • 9,586
  • 6
  • 49
  • 79
4

Compiled structure of the regex

Kobi's answer is spot on about the behavior of Java regex (Sun/Oracle implementation) for the case "^\\d{1}{2}$", or "{1}".

Below is the internal compiled structure of "^\\d{1}{2}$":

^\d{1}{2}$
Begin. \A or default ^
Curly. Greedy quantifier {1,1}
  Ctype. POSIX (US-ASCII): DIGIT
  Node. Accept match
Curly. Greedy quantifier {2,2}
  Slice. (length=0)

  Node. Accept match
Dollar(multiline=false). \Z or default $
java.util.regex.Pattern$LastNode
Node. Accept match

Looking at the source code

From my investigation, the bug is probably due to that fact that { is not properly checked in the private method sequence().

The method sequence() calls to the atom() to parse the atom, then attach quantifier to the atom by calling closure(), and chains all atoms-with-closure together into one sequence.

For example, given this regex:

^\d{4}a(bc|gh)+d*$

Then the top-level call to sequence() will receive the compiled nodes for ^, \d{4}, a, (bc|gh)+, d*, $ and chain them together.

With that idea in mind, let us look at the source code of sequence(), copied from OpenJDK 8-b132 (Oracle uses the same code base):

@SuppressWarnings("fallthrough")
/**
 * Parsing of sequences between alternations.
 */
private Node sequence(Node end) {
    Node head = null;
    Node tail = null;
    Node node = null;
LOOP:
    for (;;) {
        int ch = peek();
        switch (ch) {
        case '(':
            // Because group handles its own closure,
            // we need to treat it differently
            node = group0();
            // Check for comment or flag group
            if (node == null)
                continue;
            if (head == null)
                head = node;
            else
                tail.next = node;
            // Double return: Tail was returned in root
            tail = root;
            continue;
        case '[':
            node = clazz(true);
            break;
        case '\\':
            ch = nextEscaped();
            if (ch == 'p' || ch == 'P') {
                boolean oneLetter = true;
                boolean comp = (ch == 'P');
                ch = next(); // Consume { if present
                if (ch != '{') {
                    unread();
                } else {
                    oneLetter = false;
                }
                node = family(oneLetter, comp);
            } else {
                unread();
                node = atom();
            }
            break;
        case '^':
            next();
            if (has(MULTILINE)) {
                if (has(UNIX_LINES))
                    node = new UnixCaret();
                else
                    node = new Caret();
            } else {
                node = new Begin();
            }
            break;
        case '$':
            next();
            if (has(UNIX_LINES))
                node = new UnixDollar(has(MULTILINE));
            else
                node = new Dollar(has(MULTILINE));
            break;
        case '.':
            next();
            if (has(DOTALL)) {
                node = new All();
            } else {
                if (has(UNIX_LINES))
                    node = new UnixDot();
                else {
                    node = new Dot();
                }
            }
            break;
        case '|':
        case ')':
            break LOOP;
        case ']': // Now interpreting dangling ] and } as literals
        case '}':
            node = atom();
            break;
        case '?':
        case '*':
        case '+':
            next();
            throw error("Dangling meta character '" + ((char)ch) + "'");
        case 0:
            if (cursor >= patternLength) {
                break LOOP;
            }
            // Fall through
        default:
            node = atom();
            break;
        }

        node = closure(node);

        if (head == null) {
            head = tail = node;
        } else {
            tail.next = node;
            tail = node;
        }
    }
    if (head == null) {
        return end;
    }
    tail.next = end;
    root = tail;      //double return
    return head;
}

Take note of the line throw error("Dangling meta character '" + ((char)ch) + "'");. This is where the error is thrown if +, *, ? are dangling and is not part of a preceding token. As you can see, { is not among the cases to throw error. In fact, it is not present in the list of cases in sequence(), and the compilation process will go by default case directly to atom().

@SuppressWarnings("fallthrough")
/**
 * Parse and add a new Single or Slice.
 */
private Node atom() {
    int first = 0;
    int prev = -1;
    boolean hasSupplementary = false;
    int ch = peek();
    for (;;) {
        switch (ch) {
        case '*':
        case '+':
        case '?':
        case '{':
            if (first > 1) {
                cursor = prev;    // Unwind one character
                first--;
            }
            break;
        // Irrelevant cases omitted
        // [...]
        }
        break;
    }
    if (first == 1) {
        return newSingle(buffer[0]);
    } else {
        return newSlice(buffer, first, hasSupplementary);
    }
}

When the process enters atom(), since it encounters { right away, it breaks from switch and for loop, and a new slice with length 0 is created (the length comes from first, which is 0).

When this slice is returned, the quantifier is parsed by closure(), resulting in what we see.

Comparing the source code of Java 1.4.0, Java 5 and Java 8, there doesn't seem to be much changes in the source code of sequence() and atom(). It seems this bug has been there since the beginning.

Standard for regular expression

The top-voted answer citing IEEE-Standard 1003.1 (or POSIX standard) is irrelevant to the discussion, since Java does not implement BRE and ERE.

There are many syntax resulting in undefined behavior according to the standard, but is well-defined behavior across many other regex flavors (though whether they agree or not is another matter). For example, \d is undefined according to the standard, but it matches digits (ASCII/Unicode) in many regex flavors.

Sadly, there is no other standard on regular expression syntax.

There is, however, a standard on Unicode Regular Expression, which focuses on features a Unicode regex engine should have. Java Pattern class more or less implements Level 1 support as described in UTS #18: Unicode Regular Expression and RL2.1 (albeit extremely buggy).

Community
  • 1
  • 1
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
0

I am guessing that in definition of {} is something like "look back to find valid expression (excluding myself - {}", so in your example there is nothing between } and {.

Anyway, if you wrap it in parenthesis it will work as you expected: http://refiddle.co/gv6.

/([0-9]{1}){2}/gim
mousetail
  • 7,009
  • 4
  • 25
  • 45
IProblemFactory
  • 9,551
  • 8
  • 50
  • 66