1

I have learnt how to create a HTTP Get request method to retrieve data from a URL, but I would like to filter the response to only give me a list of the links on the webpage.

For example, if the HTML contained the following text:

<link href="http://www.thompsons.co.uk">

then it should print out:

http://www.thompsons.co.uk

Nat Ritmeyer
  • 5,634
  • 8
  • 45
  • 58
Laolu Benson
  • 37
  • 1
  • 7

3 Answers3

1

I would strongly recommend that you DO NOT use regexes to "parse" HTML. Unless you have control over the formatting of the web pages you are processing, a solution based on regexes is liable to be fragile and buggy.

Instead, use a permissive HTML parser. This Question gives a number of alternatives: HTML/XML Parser for Java

Community
  • 1
  • 1
Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
0

You read in the whole data fully, then parse it with regexp to extract the links. Read more here: http://www.mkyong.com/regular-expressions/how-to-extract-html-links-with-regular-expression/

Endy
  • 698
  • 3
  • 11
  • 1
    Ermm ... did someone mention Tony the Poney??? http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Stephen C Sep 06 '12 at 12:33
  • Depends on the case. I've used regexp when I've parsed links and/or other content from specific sources. If the case is to parse generic links, then perhaps another approach is better. – Endy Sep 06 '12 at 13:00
0

You can use jsoup:

http://jsoup.org/cookbook/extracting-data/attributes-text-html

Alexis Dufrenoy
  • 11,784
  • 12
  • 82
  • 124