java pattern matcher regex on html data

Question

How to use the java regex pattern matcher to just isolate the text Q170596, I tried to do it on regexr.com but the escape characters don't correspond to the java.

This is the text I'm trying to parse:

<!-- wikibase-toolbar --><span class="wikibase-toolbar-container"><span class="wikibase-toolbar-item wikibase-toolbar ">[<span class="wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit"><a href="/wiki/Special:SetSiteLink/Q170596">edit</a></span>]</span></span>

In order to dig out Q170596, the rest can be thrown away.

I guess it would be something like this:

//this is not right
Pattern p = Pattern.compile("<!-- wikibase-toolbar --><span class=\"wikibase-toolbar-container\"><span class=\"wikibase-toolbar-item wikibase-toolbar \">[<span class=\"wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit\"><a href=\"/wiki/Special:SetSiteLink/(.*?)\">edit<\/a><\/span>]<\/span><\/span>");

String line;
while ((line = br.readLine()) != null) 
{
    Matcher m = p.matcher(line);
    if( m.matches() ) 
    {
        String first_part    = m.group(1);
        String thing_i_want  = m.group(2);
        String more_crap = m.group(3);
    }
}

I was once told that using regex on html was not good style, is that right? But for this task I think it will work, isn't it?

Maybe you need to escape backslashes such that you have `\\/` instead of `\/` because java might try to interpret them unlike regexr — Kyborek, Apr 17 '15 at 11:41
This `Pattern p = Pattern.compile("\\[edit\\]");` is working to obtain `Q170596`, isn't it? — Wiktor Stribiżew, Apr 17 '15 at 11:41
Using the whole thing as the pattern, will just match the whole thing. I'm sure that's not what you're going for, is it? — Kraylog, Apr 17 '15 at 11:42
The second thing is: you are using m.group(1) , 2 and 3... but your expression (once it matches) only has 1 set of braces. so only m.group(0) -> the whole matching expression and m.group(1) (the matching part in braces) would exist, I think — Joeblade, Apr 17 '15 at 11:43
You need to be more specific - do you mean you want to isolate the one tag that contains the string `Q170596`? — Kraylog, Apr 17 '15 at 11:44
no i just want to get only the string `Q170596` and nothing else — smatthewenglish, Apr 17 '15 at 11:45
I'm guessing that `Q170596` is an example, and this ID changes every time. Do the rest of the tags text stay the same? — Kraylog, Apr 17 '15 at 11:46
yeah thats exactly right, the value for `Q170596` changes every time and everything else stays the same. — smatthewenglish, Apr 17 '15 at 11:47
@ErkanHaspulat since he wants the actual value, not to find out whether it matches or not — Kraylog, Apr 17 '15 at 11:53

score 2 · Accepted Answer · answered Apr 17 '15 at 11:48

Pattern p = Pattern.compile("<!-- wikibase-toolbar --><span class=\"wikibase-toolbar-container\"><span class=\"wikibase-toolbar-item " +
            "wikibase-toolbar \">\\[<span class=\"wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit\"><a " +
            "href=\"/wiki/Special:SetSiteLink/(.*?)\">edit</a></span>\\]</span></span>");

that's the pattern you want

escape the [ and ] \\[ and \\] also don't escape the

</a> and </span>

tags.

and you thought that .group(1) gets everything before the section you wanted, .group(2) gets the matching area and .group(3) gets the remainder of the line. this is not how matcher works.

each set of ( ) is a group that you can retrieve. if you use 1 set of ( ) then .group(1) will retrieve this matched group.

public class Test {
    public static void main(String[] argv) {
        Pattern p = Pattern.compile("<!-- wikibase-toolbar --><span class=\"wikibase-toolbar-container\"><span class=\"wikibase-toolbar-item " +
            "wikibase-toolbar \">\\[<span class=\"wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit\"><a " +
            "href=\"/wiki/Special:SetSiteLink/(.*?)\">edit</a></span>\\]</span></span>");
        String line = "<!-- wikibase-toolbar --><span class=\"wikibase-toolbar-container\"><span class=\"wikibase-toolbar-item wikibase-toolbar " +
            "\">[<span class=\"wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit\"><a href=\"/wiki/Special:SetSiteLink/Q170596\">edit</a></span>]</span></span>";

        Matcher m = p.matcher(line);
        if (m.matches()) {
            String first_part = m.group(1);
            System.out.println(first_part);
        }
    }
}

some pointers: this pattern can be much simpler I believe. try to minimize the pattern so for instance it only checked with body content edit.

but the value for `Q170596` will always change, though everything else remains the same, so I need to allow that to be anything, i think your code has rigidly defined that component, is it so? — smatthewenglish, Apr 17 '15 at 11:53
No, he is actually right. The part `(.*?)` is the magic you want. — Kraylog, Apr 17 '15 at 11:54
Also, if the tag you're looking for doesn't get repeated anywhere, I'd minimize the pattern to contain only that. Makes it more readable. — Kraylog, Apr 17 '15 at 11:55
what my confusion was before was the assigning of the variable `line` in that example and searching through it, when i changed it to search through the real input line of my data it worked perfectly. — smatthewenglish, Apr 17 '15 at 11:58

score 1 · Answer 2 · answered Apr 17 '15 at 11:55

no need to this huge regex! just do this:

String line = "<!-- wikibase-toolbar --><span class=\"wikibase-toolbar-container\"><span class=\"wikibase-toolbar-item wikibase-toolbar \">[<span class=\"wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit\"><a href=\"/wiki/Special:SetSiteLink/Q170596\">edit</a></span>]</span></span>";
Pattern p = Pattern.compile("(.*)<a[^=]*=\"[^\\/]*\\/([^\\/]+\\/)*(.*)\">.*");
Matcher m = p.matcher(line);
if (m.matches()) {
    System.out.println(m.group(3));
}

regex DEMO.

java pattern matcher regex on html data

2 Answers2