0

How to use the java regex pattern matcher to just isolate the text Q170596, I tried to do it on regexr.com but the escape characters don't correspond to the java.

This is the text I'm trying to parse:

<!-- wikibase-toolbar --><span class="wikibase-toolbar-container"><span class="wikibase-toolbar-item wikibase-toolbar ">[<span class="wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit"><a href="/wiki/Special:SetSiteLink/Q170596">edit</a></span>]</span></span>

In order to dig out Q170596, the rest can be thrown away.

I guess it would be something like this:

//this is not right
Pattern p = Pattern.compile("<!-- wikibase-toolbar --><span class=\"wikibase-toolbar-container\"><span class=\"wikibase-toolbar-item wikibase-toolbar \">[<span class=\"wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit\"><a href=\"/wiki/Special:SetSiteLink/(.*?)\">edit<\/a><\/span>]<\/span><\/span>");

String line;
while ((line = br.readLine()) != null) 
{
    Matcher m = p.matcher(line);
    if( m.matches() ) 
    {
        String first_part    = m.group(1);
        String thing_i_want  = m.group(2);
        String more_crap = m.group(3);
    }
}

I was once told that using regex on html was not good style, is that right? But for this task I think it will work, isn't it?

smatthewenglish
  • 2,831
  • 4
  • 36
  • 72

2 Answers2

2
Pattern p = Pattern.compile("<!-- wikibase-toolbar --><span class=\"wikibase-toolbar-container\"><span class=\"wikibase-toolbar-item " +
            "wikibase-toolbar \">\\[<span class=\"wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit\"><a " +
            "href=\"/wiki/Special:SetSiteLink/(.*?)\">edit</a></span>\\]</span></span>");

that's the pattern you want

escape the [ and ] \\[ and \\] also don't escape the

</a> and </span>

tags.

and you thought that .group(1) gets everything before the section you wanted, .group(2) gets the matching area and .group(3) gets the remainder of the line. this is not how matcher works.

each set of ( ) is a group that you can retrieve. if you use 1 set of ( ) then .group(1) will retrieve this matched group.

public class Test {
    public static void main(String[] argv) {
        Pattern p = Pattern.compile("<!-- wikibase-toolbar --><span class=\"wikibase-toolbar-container\"><span class=\"wikibase-toolbar-item " +
            "wikibase-toolbar \">\\[<span class=\"wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit\"><a " +
            "href=\"/wiki/Special:SetSiteLink/(.*?)\">edit</a></span>\\]</span></span>");
        String line = "<!-- wikibase-toolbar --><span class=\"wikibase-toolbar-container\"><span class=\"wikibase-toolbar-item wikibase-toolbar " +
            "\">[<span class=\"wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit\"><a href=\"/wiki/Special:SetSiteLink/Q170596\">edit</a></span>]</span></span>";

        Matcher m = p.matcher(line);
        if (m.matches()) {
            String first_part = m.group(1);
            System.out.println(first_part);
        }
    }
}

some pointers: this pattern can be much simpler I believe. try to minimize the pattern so for instance it only checked with body content edit.

Joeblade
  • 1,735
  • 14
  • 22
  • but the value for `Q170596` will always change, though everything else remains the same, so I need to allow that to be anything, i think your code has rigidly defined that component, is it so? – smatthewenglish Apr 17 '15 at 11:53
  • No, he is actually right. The part `(.*?)` is the magic you want. – Kraylog Apr 17 '15 at 11:54
  • Also, if the tag you're looking for doesn't get repeated anywhere, I'd minimize the pattern to contain only that. Makes it more readable. – Kraylog Apr 17 '15 at 11:55
  • it does get repeated though, a few times. – smatthewenglish Apr 17 '15 at 11:57
  • what my confusion was before was the assigning of the variable `line` in that example and searching through it, when i changed it to search through the real input line of my data it worked perfectly. – smatthewenglish Apr 17 '15 at 11:58
1

no need to this huge regex! just do this:

String line = "<!-- wikibase-toolbar --><span class=\"wikibase-toolbar-container\"><span class=\"wikibase-toolbar-item wikibase-toolbar \">[<span class=\"wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit\"><a href=\"/wiki/Special:SetSiteLink/Q170596\">edit</a></span>]</span></span>";
Pattern p = Pattern.compile("(.*)<a[^=]*=\"[^\\/]*\\/([^\\/]+\\/)*(.*)\">.*");
Matcher m = p.matcher(line);
if (m.matches()) {
    System.out.println(m.group(3));
}

regex DEMO.

Farvardin
  • 5,336
  • 5
  • 33
  • 54