1
<b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b><ul>....

I want to extract everything that comes after <b>Topic1</b> and the next <b> starting tag. Which in this case would be: <ul>asdasd</ul><br/>.

Problem: it must not necessairly be the <b> tag, but could be any other repeating tag.

So my question is: how can I dynamically extract those text? The only static thinks are:

  • The signal keyword to look for is always "Topic1". I'd like to take the surrounding tags as the one to look for.
  • The tag is always repeated. In this case it's always <b>, it might as well be <i> or <strong> or <h1> etc.

I know how to write the java code, but what would the regex be like?

String regex = ">Topic1<";
Matcher m = Pattern.compile(regex).matcher(text);
while (m.find()) {
    for (int i = 1; i <= m.groupCount(); i++) {
        System.out.println(m.group(i));
    }
}
membersound
  • 81,582
  • 193
  • 585
  • 1,120
  • 1
    Obligatory attempt to put you off using regex to parse HTML: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Andy Turner Jan 12 '16 at 16:10
  • @AndyTurner as the content I want to parse might also be corrupt html formatting, I think in this case any java xml parser would fail. So I have to stick to regex. – membersound Jan 12 '16 at 16:15
  • You could do something like `<(\w+)>Topic1<\/\1>` to match different tags. Will it always be followed by another Topic? If not, is the only other scenario the end of the document? – lintmouse Jan 12 '16 at 16:31

2 Answers2

2

The following should work

Topic1</(.+?)>(.*?)<\\1>

Input: <b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b><ul>

Output: <ul>asdasd</ul><br/>

Code:

    Pattern p = Pattern.compile("Topic1</(.+?)>(.*?)<\\1>");
    //  get a matcher object
    Matcher m = p.matcher("<b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b><ul>");
    while(m.find()) {
        System.out.println(m.group(2));  // <ul>asdasd</ul><br/>
    }
Martin Konecny
  • 57,827
  • 19
  • 139
  • 159
  • Great, that seems close. What if the input string might be distributed over several lines, thus contain linebreaks? How could I make the `*` operator also match anything, whatever it is, even line breaks? – membersound Jan 12 '16 at 16:38
  • 1
    In that case use `Pattern.compile("", Pattern.DOTALL);` – Martin Konecny Jan 12 '16 at 16:40
0

Try this

String pattern = "\\<.*?\\>Topic1\\<.*?\\>"; // this will see the tag no matter what tag it is
String text = "<b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b>"; // your string to be split
String[] attributes = text.split(pattern);
for(String atr : attributes) 
{
    System.out.println(atr);
}

Will print out:

<ul>asdasd</ul><br/><b>Topic2</b>
mcjcloud
  • 351
  • 2
  • 11