1
public static void main(String[] args) {

        Pattern compile = Pattern
                .compile("[0-9]{1,}[A-Za-z]{1,}|[A-Za-z][0-9]{1,}|[a-zA-Z][a-zA-Z0-9\\.\\-_/#]{2,}|[0-9]{3,}[A-Za-z][a-zA-Z0-9\\.\\-_/#]*|[0-9][0-9\\-]{4,}|[0-9][0-9\\-]{3,}[a-zA-Z0-9\\.\\-_/#]+");
        Matcher matcher = compile.matcher("i5-2450M");
        matcher.find();
        System.out.println(matcher.group(0));
    }

I assume this should return i5-2450M but it returns i5 actually

onemach
  • 4,265
  • 6
  • 34
  • 52
jackalope
  • 1,554
  • 3
  • 17
  • 37
  • 2
    You could include word boundaries in your match. – paddy Aug 21 '12 at 04:50
  • 1
    The limit with regex is more often determined by the limitations of the developer, i.e. how much you can easily understand. If you read this code in six months time, how much will be obvious to you? – Peter Lawrey Aug 21 '12 at 07:55

2 Answers2

4

The problem is that the first alternation that matches is used.

In this case the 2nd alternation ([A-Za-z][0-9]{1,}, which matches i5) "shadows" any following alternation.

// doesn't match
[0-9]{1,}[A-Za-z]{1,}|
// matches "i5"
[A-Za-z][0-9]{1,}|
// the following are never even checked, because of the previous match
[a-zA-Z][a-zA-Z0-9\\.\\-_/#]{2,}|
[0-9]{3,}[A-Za-z][a-zA-Z0-9\\.\\-_/#]*|
[0-9][0-9\\-]{4,}|
[0-9][0-9\\-]{3,}[a-zA-Z0-9\\.\\-_/#]

(Please note, that there are likely serious issues with the regular expression in the post -- for instance, 0---# would be matched by the last rule -- which should be addressed, but are not below due to not being the "fundamental" problem of the alternation behavior.)

To fix this issue, arrange the alternations with the most specific first. In this case it would be putting the 2nd alternation below the other alternation entries. (Also review the other alternations and the interactions; perhaps the entire regular expression can be simplified?)

The use of a simple word boundary (\b) will not work here because - is considered a non-word character. However, depending upon the meaning of the regular expression, anchors ($ and ^) could be used around the alternation: e.g. ^existing_regex$. This doesn't change the behavior of the alternation, but it would cause the initial match of i5 to be backtracked, and thereby causing subsequent alternation entries to be considered, due to not being able to match the end-of-input immediately after the alternation group.


From Java regex alternation operator "|" behavior seems broken:

Java uses an NFA, or regex-directed flavor, like Perl, .NET, JavaScript, etc., and unlike sed, grep, or awk. An alternation is expected to quit as soon as one of the alternatives matches, not hold out for the longest match.

(The accepted answer in this question uses word boundaries.)

From Pattern:

The Pattern engine performs traditional NFA-based matching with ordered alternation as occurs in Perl 5.

Community
  • 1
  • 1
  • Yes that's all truth , but this still can't solve my problem. You just tell me why and not how. – jackalope Aug 21 '12 at 05:20
  • @ruby-boy Also consider that such a general regular expression approach may or may not be .. ideal .. based on exact goals/requirements. Here is an *incomplete* list of just Intel [process nomenclatures](http://en.wikipedia.org/wiki/Comparison_of_Intel_processors). –  Aug 21 '12 at 05:44
0

Try to iterate over the matches (i.e. while matcher(text).find())