0

What I have to modify in this code:

String tags = "<div class='bat'><div id='me'>";
Pattern r = Pattern.compile("<(.*)>",Pattern.CASE_INSENSITIVE| Pattern.MULTILINE | Pattern.DOTALL );

// Now create matcher object.
Matcher m = r.matcher(tags);
while (m.find( )) {
    System.out.println("Found : " + m.groupCount() );
    System.out.println(m.group());   
}

OUTPUT :

Found : 1
<div class='bat'><div id='me'>

and I want this OUTPUT :

Found: 2
div class='bat'
div id='me'
Vitalii Elenhaupt
  • 7,146
  • 3
  • 27
  • 43
Snakox
  • 75
  • 1
  • 6
  • Use a reluctant quantifier `.*?`. – Sotirios Delimanolis Jun 13 '15 at 19:04
  • 1
    @SotiriosDelimanolis It's the other way around ;) `*` is greedy and you make it lazy with `?`. Btw you should put that as an answer. – Alexis C. Jun 13 '15 at 19:05
  • @AlexisC. Oh yeah, messed up terminology. Thanks. I don't want to explain how to get the rest of the requested output :| – Sotirios Delimanolis Jun 13 '15 at 19:07
  • 2
    Your question looks like you are trying to build HTML parser. Can't you use already existing ones like [jsoup](http://jsoup.org/)? Notice that [regex is not good tool to parse HTML](http://stackoverflow.com/q/701166/1393766). Another mandatory question to read: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/q/1732348/1393766). – Pshemo Jun 13 '15 at 19:09

3 Answers3

0

You will need look ahead and look behind to do this

i.e. (?<=<)([^>]*)(?=>)

String tags = "<div class='bat'><div id='me'>";
Pattern r = Pattern.compile("(?<=<)([^>]*)(?=>)", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);

Output :

Found : 1
div class='bat'
Found : 1
div id='me'


Edit replaced .*? with [^>]* for good performance as suggested by PSchemo

afzalex
  • 8,598
  • 2
  • 34
  • 61
  • 1
    Instead of reluctant quantifier we can get better performance and similar readability with `<[^>]*>` instead of `<.*?>` – Pshemo Jun 13 '15 at 19:11
  • 2
    Since you are already using groups you don't need to use look-around mechanism. Simple `<([^>]*)>` and `matcher.group(1)` should do the trick. – Pshemo Jun 13 '15 at 19:18
  • @Pshemo Yes it could also be done this way, I didn't noticed that. – afzalex Jun 13 '15 at 19:22
0

You have to change your regex and also your code to add the regex group index, like this:

String tags = "<div class='bat'><div id='me'>";
Pattern r = Pattern.compile("<(.*?)>",Pattern.CASE_INSENSITIVE| Pattern.MULTILINE | Pattern.DOTALL );
                                 ^-- use non greedy quantifier
// Now create matcher object.
Matcher m = r.matcher(tags);
while (m.find( )) {
    System.out.println("Found : " + m.groupCount() );
    System.out.println(m.group(1));  
                               ^--- use regex index 1   
}

Working demo

Regular expression visualization

However, above code won't give you 2 groups, but 1 group matched 2 times. If you want to have the content in 2 groups, then you will have to use below code:

String tags = "<div class='bat'><div id='me'>";
Pattern r = Pattern.compile("<(.*?)><(.*?)>",Pattern.CASE_INSENSITIVE| Pattern.MULTILINE | Pattern.DOTALL );

// Now create matcher object.
Matcher m = r.matcher(tags);
if (m.find( )) {
    System.out.println("Found : " + m.groupCount() );
    System.out.println(m.group(1));   
    System.out.println(m.group(2));   
}

Working demo

Regular expression visualization

Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
0

groupCount() does not indicate how many times the pattern matched. It just tells how many capturing groups there are in the regex. If groupCount() returns 2, you know it's safe to access group(1) or group(2), but group(3) will raise an exception.

It makes no sense to call groupCount() inside your while (m.find()) loop, because it never changes. It's a static property of the Pattern object, so you can call it before you make your first match. It's only useful when you don't know what regex is being used, which is fairly rare.

As the other responders have said, your problem is the greediness of the quantifier in (.*), and the solution is to use a non-greedy variant or a negated character class.

String tags = "<div class='bat'><div id='me'>";

Pattern r = Pattern.compile("<([^<>]*)>"); // no modifiers needed
Matcher m = r.matcher(tags);
System.out.printf("Number of groups: %s%n", m.groupCount() );
while (m.find()) {
    System.out.println(m.group(1));   
}

Notice that I dropped all the option

Alan Moore
  • 73,866
  • 12
  • 100
  • 156