troublesome regular expression in java

Question

i was hoping someone could help me understand why this happens:

    String s = "tbody\n" +"a\n" +"/tbody";
    Pattern p = Pattern.compile("tbody[^(/tbody)]+/tbody"); 

    Matcher m = p.matcher(s);

    while(m.find()){
        System.out.println("found: \n\n"+m.group());            
    }

Output is:

found: 

tbody

a

/tbody

But if String s = "tbody\n" +"ao\n" +"/tbody" (I added an o after the a) it prints nothing. Can anyone tell me what I am missing?

I'm using NetBeans 7.4.

`[..]` in a regular expression is a *character class* - now you know the name, look it up :) In any case, consider just using a *non-greedy/lazy quantifier*: `tbody(.*?)/tbody` (you may also be interested in *word boundaries*). — user2864740, Jan 21 '14 at 22:27
You seem to be trying to figure out how to parse HTML with regular expressions. This is a non-starter, since HTML is not a regular language. Please read [this answer](http://stackoverflow.com/a/1732454/18157) — Jim Garrison, Jan 21 '14 at 22:46
@JimGarrison i'm not sure what i'm trying to do is parsing. I need to collect info from a specific website, wich lies between those tags. — user2847339, Jan 22 '14 at 03:17
You'll be much better off if you use a real HTML parser like JSoup — Jim Garrison, Jan 22 '14 at 04:01

score 1 · Accepted Answer · answered Jan 21 '14 at 22:31

The [^(/tbody)] is not what you thought it is. It does not mean any string which is not /tbody. Instead it negates each char one by one. Now /tbody contains o and you added an o (so you have that o negated). That's why it does not match any more.

Try adding x instead of o and it will keep working (as x is not among the chars you negated).

troublesome regular expression in java

1 Answers1