Extract between html tag with unknown tagname?

Question

<b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b><ul>....

I want to extract everything that comes after Topic1 and the next  starting tag. Which in this case would be: <ul>asdasd</ul> .

Problem: it must not necessairly be the  tag, but could be any other repeating tag.

So my question is: how can I dynamically extract those text? The only static thinks are:

The signal keyword to look for is always "Topic1". I'd like to take the surrounding tags as the one to look for.
The tag is always repeated. In this case it's always , it might as well be  or  or <h1> etc.

I know how to write the java code, but what would the regex be like?

String regex = ">Topic1<";
Matcher m = Pattern.compile(regex).matcher(text);
while (m.find()) {
    for (int i = 1; i <= m.groupCount(); i++) {
        System.out.println(m.group(i));
    }
}

Obligatory attempt to put you off using regex to parse HTML: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Andy Turner, Jan 12 '16 at 16:10
@AndyTurner as the content I want to parse might also be corrupt html formatting, I think in this case any java xml parser would fail. So I have to stick to regex. — membersound, Jan 12 '16 at 16:15
You could do something like `<(\w+)>Topic1<\/\1>` to match different tags. Will it always be followed by another Topic? If not, is the only other scenario the end of the document? — lintmouse, Jan 12 '16 at 16:31

score 2 · Accepted Answer · answered Jan 12 '16 at 16:31

2

The following should work

Topic1</(.+?)>(.*?)<\\1>

Input: Topic1<ul>asdasd</ul> Topic2<ul>

Output: <ul>asdasd</ul> 

Code:

    Pattern p = Pattern.compile("Topic1</(.+?)>(.*?)<\\1>");
    //  get a matcher object
    Matcher m = p.matcher("<b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b><ul>");
    while(m.find()) {
        System.out.println(m.group(2));  // <ul>asdasd</ul><br/>
    }

answered Jan 12 '16 at 16:31

Martin Konecny

57,827
19
139
159

Great, that seems close. What if the input string might be distributed over several lines, thus contain linebreaks? How could I make the `*` operator also match anything, whatever it is, even line breaks? – membersound Jan 12 '16 at 16:38
1

In that case use `Pattern.compile("", Pattern.DOTALL);` – Martin Konecny Jan 12 '16 at 16:40

score 0 · Answer 2 · answered Jan 12 '16 at 16:31

Try this

String pattern = "\\<.*?\\>Topic1\\<.*?\\>"; // this will see the tag no matter what tag it is
String text = "<b>Topic1</b><ul>asdasd</ul><br/><b>Topic2</b>"; // your string to be split
String[] attributes = text.split(pattern);
for(String atr : attributes) 
{
    System.out.println(atr);
}

Will print out:

<ul>asdasd</ul><br/><b>Topic2</b>

Extract between html tag with unknown tagname?

2 Answers2