Get only href content and src content from html

Question

I am wondering how to extract only href and src content from html content. I tried regular expression but I failed.

This is the text that I want to get href and src content from:

<a href="http://rdmobile.fr/blog/mobile-la-pub-consomme-plus-que-les-applications-elles-memes/"><img align="left" hspace="5" width="150" height="150" src="http://rdmobile.fr/blog/wp-content/uploads/2012/03/angry-birds-150x150.jpg" class="alignleft tfe wp-post-image" alt="angry-birds" title="angry-birds" /></a>Si vous aussi vous vous étonnez de voir votre batterie fondre comme neige au soleil dès lors que jouez à Angry Birds, rassurez-vous, c’est normal. Des chercheurs de l&#8217;université de Purdue se sont intéressés aux publicités destinées majoritairement aux applications gratuites, et oui, comment les développeurs mangent-ils autrement ? Plus sérieusement, cette étude, publiée sur le [...]

I want to extract data like this.

href content :http://rdmobile.fr/blog/mobile-la-pub-consomme-plus-que-les-applications-elles-memes/ src content : http://rdmobile.fr/blog/wp-content/uploads/2012/03/angry-birds-150x150.jpg

Can any one help me with this and I like to learn basic regular expression too.

Thanks, Isuru

score 2 · Answer 1 · answered Apr 11 '13 at 12:45

A DOM parser like JSoup is great for this type of problem, and allows for straight-forward interactions with the document & using CSS style selectors:

Document document = Jsoup.connect(url).get();
Elements elementsWithSrcAttributes = document.select("[src]");
Elements elementsWithHrefAttributes = document.select("[href]");

for (Element element: elementsWithSrcAttributes) {
    System.out.println("src content: " + element.attr("src"));
}

for (Element element: elementsWithHrefAttributes) {
    System.out.println("href content: " + element.attr("href"));
}

score 0 · Answer 2 · answered Apr 11 '13 at 12:32

0

You could parse the content using an XML parser.

Look at Parsing XML Data

answered Apr 11 '13 at 12:32

Nicolas Dusart

1,867
18
26

score 0 · Answer 3 · edited May 23 '17 at 11:44

You don't want to use regular expressions for that. Just... just don't. Bad things happen.

What you want to use is XPath. For a given HTML document, the /a/@href XPath expression will return all href attributes of a nodes. Think of it as regular expressions for XML.

The hard part isn't XPath, which is relatively straightforward, but obtaining a valid DOM from an HTML file. I'd recommend Cyberneko, but have no idea whether that's compatible with your Android requirement.

score -1 · Accepted Answer · answered Apr 11 '13 at 15:24

Extracting data from html using regular expressions is not generally recommended, but the following is an example of one basic approach

String str = "<a href=\"http://rdmobile.fr/blog/mobile-la-pub-consomme-plus-que-les-applications-elles-memes/\"><img align=\"left\" hspace=\"5\" width=\"150\" height=\"150\" src=\"http://rdmobile.fr/blog/wp-content/uploads/2012/03/angry-birds-150x150.jpg\" class=\"alignleft tfe wp-post-image\" alt=\"angry-birds\" title=\"angry-birds\" /></a>Si vous aussi vous vous étonnez de voir votre batterie fondre comme neige au soleil dès lors que jouez à Angry Birds, rassurez-vous, c’est normal. Des chercheurs de l&#8217;université de Purdue se sont intéressés aux publicités destinées majoritairement aux applications gratuites, et oui, comment les développeurs mangent-ils autrement ? Plus sérieusement, cette étude, publiée sur le [...]";        
Matcher m = Pattern.compile(" (?:href|src)=\"([^\"]+)").matcher(str);

while (m.find()) {
    System.out.println(m.group(1));
}

The above will only match any sequence of one or more characters that are not ", when it is preceded by either ' href="' or ' src="'.

Therefore it will not match if single or no quotes surround the attribute value or if there are any spaces around the =.

Further explanation on request, or see, for example, Regular-Expressions.info.

Get only href content and src content from html

4 Answers4