How do I filter a HTTP get response?

Question

I have learnt how to create a HTTP Get request method to retrieve data from a URL, but I would like to filter the response to only give me a list of the links on the webpage.

For example, if the HTML contained the following text:

<link href="http://www.thompsons.co.uk">

then it should print out:

http://www.thompsons.co.uk

score 1 · Answer 1 · edited May 23 '17 at 11:59

1

I would strongly recommend that you DO NOT use regexes to "parse" HTML. Unless you have control over the formatting of the web pages you are processing, a solution based on regexes is liable to be fragile and buggy.

Instead, use a permissive HTML parser. This Question gives a number of alternatives: HTML/XML Parser for Java

edited May 23 '17 at 11:59

Community

1
1

answered Sep 06 '12 at 12:24

Stephen C

698,415
94
811
1,216

score 0 · Answer 2 · answered Sep 06 '12 at 12:18

0

You read in the whole data fully, then parse it with regexp to extract the links. Read more here: http://www.mkyong.com/regular-expressions/how-to-extract-html-links-with-regular-expression/

answered Sep 06 '12 at 12:18

Endy

698
3
11

1

Ermm ... did someone mention Tony the Poney??? http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Stephen C Sep 06 '12 at 12:33
Depends on the case. I've used regexp when I've parsed links and/or other content from specific sources. If the case is to parse generic links, then perhaps another approach is better. – Endy Sep 06 '12 at 13:00

score 0 · Accepted Answer · answered Sep 06 '12 at 12:55

0

You can use jsoup:

http://jsoup.org/cookbook/extracting-data/attributes-text-html

answered Sep 06 '12 at 12:55

Alexis Dufrenoy

11,784
12
82
124

How do I filter a HTTP get response?

3 Answers3