Get a list of all the urls in a web page

Question

What's the best way to get an array of all the URLs in a web page? and how would I do it?

What kind of url usages are you thinking of? `href` on links, `action` on forms, `src` on images and others, url literals present anywhere on the page, links to css/js, etc — CyberDude, Sep 04 '10 at 10:56
I support CyberDude's proposal. Please specify which kinds of URLs you want exactly. — Julius F, Sep 04 '10 at 11:02

score 2 · Accepted Answer · answered Sep 04 '10 at 10:57

2

Using HTML Agility Pack is a good way, maybe not the best as this would be subjective but I can tell you the worst and this is using regular expression to parse html (as you've tagged your question with regex I feel myself ion the obligation to point this out).

answered Sep 04 '10 at 10:57

Darin Dimitrov

1,023,142
271
3,287
2,928

2

Why has this been downvoted? Please leave a comment when downvoting an answer to express your opinion on why do you think this answer is wrong. – Darin Dimitrov Sep 04 '10 at 11:57

score 1 · Answer 2 · edited May 23 '17 at 12:01

1

/<a href=\"([^\"]*)\">(.*)<\/a>/iU

or use this previous answer:

Regular expression for parsing links from a webpage?

edited May 23 '17 at 12:01

Community

1
1

answered Sep 04 '10 at 10:59

I can't believe there are answers suggesting using regex to do screen scraping. – Darin Dimitrov Sep 04 '10 at 11:05
1

Darin, I don't suggest, I answer the question, the guy is an adult, not a kid. He has a your answer to think about it. – Sep 04 '10 at 11:10
@Pierre 303, SO is not for just answering questions without thinking. It's primary for advocating good practices on how to solve programming related problems. SO is a pretty well referenced site and many people are reading it as a source of information on good practices, and suggesting to use regex to parse HTML in C# is simply [not a good practice](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). I am sorry. – Darin Dimitrov Sep 04 '10 at 11:59
Darin, except to prove your technical supperiority, I really don't understand your obstination. Maybe being down voted by someone else than me hurted your feelings ? Please put yourself in the shoes of the guy/girl that answer the question. He/she is not asking for an advice in his/her question, he/she is asking for an answer. It's what I provide. If he wants advice, you have provided in your answer. So SO is a great place to have DIFFERENT answers, and not only ONE from the supperior technical guys. – Sep 04 '10 at 12:14
Oh, and he/she doesn't want to specifically parse HTML, he/she wants to retrieve ALL URL contained in a text document, that, in her/his case, is an HTML document. Parsing HTML looks like an overhead/overengeneering to me. Will you pay for that instead of having a simple solution that just works? – Sep 04 '10 at 12:20
thanks Pierre 303. How would I retrieve the results and convert them into an array in C#? – Rana Sep 04 '10 at 18:34
There is a full example at http://dotnetperls.com/scraping-html and a more simple one at http://oreilly.com/windows/archive/csharp-regular-expressions.html – Sep 04 '10 at 18:37

Get a list of all the urls in a web page

2 Answers2