Regex to extract link content

Question

I'll be the first to admit that my Regex knowledge is hopeless. I am using java with the following

Matcher m = Pattern.compile(">[^<>]*</a>").matcher(html);
while (m.find()) {
 resp.getWriter().println(html.substring(m.start(), m.end()));
}

I get the following list:

>Link Text a</a>
>Link Text b</a>

What am I missing to remove the > and the </a>.

Cheers.

@Littlejon - Regex+HTML questions aren't very popular these days. (btw, I'm not getting in the middle of this again... the previous one was my most downvoted answer ever. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags ) — Kobi, Nov 15 '09 at 09:55
@Kobi - So i have seen. But i am only searching a snippet of HTML. Also tried using DOM without much success. — Littlejon, Nov 15 '09 at 09:58
As an addendum, I am fully aware of the limitations and am fully prepared to point a loaded gun at my own foot :-) — Littlejon, Nov 15 '09 at 10:04

score 2 · Answer 1 · answered Nov 15 '09 at 09:53

2

Have you looked at using a capturing group ?

Pattern.compile(">([^<>]*)</a>")

Note however that it's generally not recommended to use regular expressions for HTML, since HTML isn't regular. You will get more reliable results by using an HTML parser such as JTidy.

answered Nov 15 '09 at 09:53

Brian Agnew

268,207
37
334
440

This answer is also correct. Changing the html.substring(m.start(), m.end()) to m.group(1) makes this work. – Littlejon Nov 15 '09 at 10:12

score 2 · Answer 2 · answered Nov 15 '09 at 09:56

Keep in mind that due to its limited nature, your regex (and regex in general) may run into problems if the HTML you're trying to parse is slightly more complex. For example, the following would fail to parse correctly, but is completely valid (and common) HTML:

<a href="blah.html">this is only a <em>single</em> link</a>

You might be better off using a DOM parser (I'm pretty sure Java has plenty of options in this regard) that you can then request the inner-text of each <a> tag.

nah, it won't fail, it just won't give you what you expect.. ;) "> link" — falstro, Nov 15 '09 at 09:59

Bart Kiers · Accepted Answer · 2009-11-15T10:30:10.880

You can do that by wrapping a group around that part of your regex and then using group(X) where X is the number of the group:

Matcher m = Pattern.compile(">([^<>]*)</a>").matcher(html);
while (m.find()) {
 resp.getWriter().println(m.group(1));
}

But, a better way would be to use a simple parser for this:

import java.io.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class HtmlParseDemo {
   public static void main(String [] args) throws Exception {
       Reader reader = new StringReader("foo <a href=\"#\">Link 1</a> bar <a href=\"#\">Link <b>2</b> more</a> baz");
       HTMLEditorKit.Parser parser = new ParserDelegator();
       parser.parse(reader, new LinkParser(), true);
       reader.close();
   }
}

class LinkParser extends HTMLEditorKit.ParserCallback {

    private boolean linkStarted = false;
    private StringBuilder b = new StringBuilder();

    public void handleText(char[] data, int pos) {
        if(linkStarted) b.append(new String(data));
    }

    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        if(t == HTML.Tag.A) linkStarted = true;
    }

    public void handleEndTag(HTML.Tag t, int pos) {
        if(t == HTML.Tag.A) {
            linkStarted = false;
            System.out.println(b);
            b = new StringBuilder();
        }
    }
}

Output:

Link 1
Link 2 more

Can I find the link i.e '#' instead of Link 1 or Link 2 more ? — Rites, Jan 13 '10 at 09:30

score 1 · Answer 4 · answered Nov 15 '09 at 10:37

I'm late to the party but I'd like to point out another alternative:

(?<=X)      X, via zero-width positive lookbehind

If you put your initial > into that mess, i.e.

(?<=>)[^<>]*</a>

then it should not be returned as part of your result.

Untested, though. Good luck!

score 0 · Answer 5 · answered Nov 15 '09 at 15:04

0

A nice quick way to test your regular expressions, is to use a regex editor such as the following eclipse plugin: http://brosinski.com/regex/

answered Nov 15 '09 at 15:04

crowne

8,456
3
35
50

Regex to extract link content

5 Answers5