-1

I am trying to use regular expressions in Java to match all string of the form <b><number></b> that are contained within a <a><\a> pair.

e.g. <a> kljsdlk <b>123</b> df <b>345</b> sdfklj</a> should match twice with <b>123</b> and <b>345</b>, while <v> kljsdlk <b>123</b> df <b>345</b> sdfklj</v> should yield no results (because there is no wrapping <a></a>).

The following code is my current best result:

        Pattern MY_PATTERN = Pattern.compile("(<a>.*(<b>[0-9]*<\\\\b>)?.*<\\\\a>)");

    Matcher m = MY_PATTERN.matcher("<a> skdjlkasjflkj <b>200<\\b> sldfhjhfj d lkj b <b>300<\\b> fhih 9 09 <\\a>");
    while (m.find()) {
        for (int i=0; i< m.groupCount() ;i++){
            String s = m.group(i);
            System.out.println(s);
        }
    }

This code result with:

<a> skdjlkasjflkj <b>200<\b> sldfhjhfj d lkj b <b>300<\b> fhih 9 09 <\a>
<a> skdjlkasjflkj <b>200<\b> sldfhjhfj d lkj b <b>300<\b> fhih 9 09 <\a>

I would like it to result in:

<b>200<\b>
<b>300<\b>
summerbulb
  • 5,709
  • 8
  • 37
  • 83
  • Do not mix regex with HTML. – hsz Apr 03 '13 at 11:49
  • 4
    [Do not try to parse HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Buggabill Apr 03 '13 at 11:49
  • 1
    @StefanBeike - This question has been asked a very large number of times. There are extremely limited instances where parsing markup with regex might possibly be acceptable. You are playing with fire when you do, though. One of the down votes was mine, and my comment links to an explanation that has been used on here many, many times. It is better to use one of any number of available libraries to do the duty. – Buggabill Apr 03 '13 at 12:03

3 Answers3

1

Why not match for <a>.*</a> first, and then look for <b>[0-9]*</b>?

    Pattern p1 = Pattern.compile("(<a>.*</a>)");
    Pattern p2 = Pattern.compile("<b>\\d*</b>");
    Matcher m1 = m1 = p1.matcher("<a> kljsdlk <b>123</b> df <b>345</b> sdfklj</a>");
    if (m1.find()) {
      Matcher m2 = p2.matcher(m1.group());
      while (m2.find()) {
        System.out.println(m2.group());
      }
    }
devnull
  • 118,548
  • 33
  • 236
  • 227
  • but that is exactly what @summerbulb wanted. – YaOg Apr 03 '13 at 12:23
  • 1
    @YaOg The greediness of `*` works for this trivial example (how often does a trivial example match the actual environment?!), but if there are more than one `...` in the target string, then there will be issues--depending on the goal. – Kenneth K. Apr 03 '13 at 13:29
0

The problem is that you have matched the whole string in the pattern. You should only create a pattern for the inner tag. That will provide the correct matching strings

MozenRath
  • 9,652
  • 13
  • 61
  • 104
0

If Java supported arbitrary-length lookbehinds, then you might be able to do this. Without that, this won't be possible in regex alone. Besides, since it's HTML/XML a library dedicated to such would be more intuitive to use.

Kenneth K.
  • 2,987
  • 1
  • 23
  • 30