1

The goal of my program is to grab a webpage and then generate a list of Absolute links with the pages it links to.

The problem I am having is when a page redirects to another page without the program knowing, it makes all the relative links wrong.

For example:

I give my program this link: moodle.pgmb.si/moodle/course/view.php?id=1

On this page, if it finds the link href="signup.php" meaning signup.php in the current directory, it errors because there is no directory above the root.

However this error is invalid because the page's real location is:
moodle.pgmb.si/moodle/login/index.php

Meaning that "signup.php" is linking to moodle.pgmb.si/signup.php which is a valid page, not moodle.pgmb.si/moodle/course/signup.php like my program thinks.

So my question is how is my program supposed to know that the page it received is at another location?

I am doing this in C Sharp using the follownig code to get the HTML

WebRequest wrq = WebRequest.Create(address);
WebResponse wrs = wrq.GetResponse();
StreamReader strdr = new StreamReader(wrs.GetResponseStream());
string html = strdr.ReadToEnd();
strdr.Close();
wrs.Close();
IAbstract
  • 19,551
  • 15
  • 98
  • 146
CoderWalker
  • 299
  • 4
  • 14
  • Hi, if you found my answer helpful, it would've great if you could mark it as the accepted answer. Thanks! If not, let me know how I can improve it and I'd be glad to do so. – msigman Mar 24 '12 at 16:11

3 Answers3

2

What I would do is first check if each link is absolute or relative by searching for an "http://" within it. If it's absolute, you're done. If it's relative, then you need to append the path to the page you're scanning in front of it.

There are a number of ways you could get the current path: you could Split() it on the slashes ("/"), then recombine all but the last one. Or you could search for the last occurrence of a slash and then take a substring of up to and including that position.

Edit: Re-reading the question, I'm not sure I am understanding. href="signup.php" is a relative link, which should go to the /signup.php. So the current behavior you mentioned is correct "moodle.pgmb.si/moodle/course/signup.php."

msigman
  • 4,474
  • 2
  • 20
  • 32
2

The problem is that, if the URL isn't a relative or absolute URL, then you have no way of knowing where it goes unless you request it. Even then, it might not actually be being served from where you think it is located. This is because it might actually be implemented as a HTTP Redirect or similar server side.

So if you want to be exhaustive, what you can do is:

  1. Use your current technique to grab a list of all links on the page.
  2. Attempt to request each of those pages. Then if you:
    1. Get a 200 responce code then all is good - it's there.
    2. Get a 404 response code you know the page does not exist
    3. Get a 3XX response code then you know where the web server expects that content to actually orginate form.

Your (Http)WebResponse object should have a ResponseCode property. Note that you should also handle any possible WebException errors - these too will have a WebResponse with a ResponseCode in (usually 5xx).

You can also look at the HttpWebResponse Headers property - the Location header.

Community
  • 1
  • 1
dash
  • 89,546
  • 4
  • 51
  • 71
2

You should be able to use ResponseUri method of WebResponse class. This will contain the URI of the internet resource that actually provided the response data, as opposed to the resource that was requested. You can then use this URI to build correct links.

http://msdn.microsoft.com/en-us/library/system.net.webresponse.responseuri.aspx

John Saunders
  • 160,644
  • 26
  • 247
  • 397
Porco
  • 4,163
  • 3
  • 22
  • 25