0

Hi I am trying to extract text which a href defines in a html line. For example:

<link rel="stylesheet" href="style.css" type="text/css">

I want to get "style.css" or:

<a href="target0.html"><img align="center" src="thumbnails/image001.jpg" width="154" height="99">

I want to get "target0.html"

What would be the correct Java code to do this?

Olcay Ertaş
  • 5,987
  • 8
  • 76
  • 112
  • 2
    I think the answer on this question is what you are looking for: http://stackoverflow.com/questions/1670593/java-i-have-a-big-string-of-html-and-need-to-extract-the-href-text – DiogoDoreto Nov 21 '11 at 18:29
  • Mandatory SO link: http://stackoverflow.com/questions/1732348 Read the answer with the most upvotes ; ) – TacticalCoder Nov 21 '11 at 19:06

3 Answers3

1
    public static String getHref(String str)
    {
        int startIndex = str.indexOf("href=");
        if (startIndex < 0)
            return "";
        return str.substring(startIndex + 6, str.indexOf("\"", startIndex + 6));
    }

This method assumes that the html is well formed and it only works for the first href in the string but I'm sure you can extrapolate from here.

aeoliant
  • 377
  • 2
  • 10
1

I realize you asked about using regular expressions, but jsoup makes this so simple and is much less error prone:

import java.io.IOException;

import nu.xom.ParsingException;
import nu.xom.ValidityException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.xml.sax.SAXException;

public class HrefExtractor {
    public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
        final Document document = Jsoup.parse("<a href=\"target0.html\"><img align=\"center\" src=\"thumbnails/image001.jpg\" width=\"154\" height=\"99\">");
        final Elements links = document.select("a[href]");
        for (final Element element : links) {
            System.out.println(element.attr("href"));
        }
    }
}
laz
  • 28,320
  • 5
  • 53
  • 50
0

I have not try the following but it should be something like this:

'Pattern.compile("<(?:link|a\s+)[^>]*href=\"(.*?)\"")'

But I'd recommend you to use one of available HTML or even XML parsers for this task.

Nimantha
  • 6,405
  • 6
  • 28
  • 69
AlexR
  • 114,158
  • 16
  • 130
  • 208