-2

I'm working a PHP script to try and resolve a vague URL (for example typing in facebook.com) as an absolute url (such as https://www.facebook.com); similar to what your browser does on a daily basis using PHP.

So far I've got the following code:

$link = gethostbyname("facebook.com");

This provides an IPV4 address, which works, but then when I reverse lookup using:

$link2 = gethostbyaddr($link);

I'm expecting to receive a valid URL like "https://www.facebook.com", but instead, I get garbage such as "'edge-star-mini-shv-13-atn1.facebook.com'"

This then breaks any hope of using fopen or curl to try and read the contents of the webpage.

Can anyone explain what's gone wrong here and how I can resolve it?

EDIT: Attempting an insecure URL like "google.co.uk" returns "'lhr25s10-in-f3.1e100.net'", so it's not something to do with secure HTTP (HTTPS)

Raisus
  • 148
  • 3
  • 23
  • @C0dekid - Yes. "google.co.uk" provides "'lhr25s10-in-f3.1e100.net'" and that's not secure at all – Raisus Jun 15 '16 at 12:49
  • Hold on.. host is the domain this IP is hosted, use a GeoIP tool and see it yourself. Hostbyaddr is not the website link. – node_modules Jun 15 '16 at 12:50
  • according to PHP.net (http://php.net/manual/en/function.gethostbyaddr.php) it is – Raisus Jun 15 '16 at 12:51
  • Try `tracert facebook.com` or linux `traceroute facebook.com` and you'll see many garbage. These are the hostnames. – Holger Jun 15 '16 at 12:52
  • a host name is not a URL. – Karoly Horvath Jun 15 '16 at 12:52
  • oh. Well, that would make sense... – Raisus Jun 15 '16 at 12:52
  • I don't see why this has so many down votes, it is an interesting topic that the servers that host a website are *not* the servers that distinguish the name of the website. – Martin Jun 15 '16 at 12:56
  • @Martin - Thank you. I am grateful that someone appreciates the question and it's not made clear in PHP how your browsers do this, so I thought it was a valid question – Raisus Jun 15 '16 at 12:57
  • To be honest Raisus, I think you need to read up quite a bit more on the details of how DNS servers work and how the identification of websites is actually structured, the browser only presents a very limited and cleaned tidy version of the collection of addresses, IP's, names and servers out on the World Wide Web – Martin Jun 15 '16 at 13:02

1 Answers1

0

gethostbyaddr gets a hostname, not a URL, for an IP address.

Multiple hostnames can be assigned to a single IP address.

gethostbyaddr will get the default one.

An HTTP server listening on that IP address will handle requests to all the hostnames.

An HTTP request includes a request header called Host which specifies which hostname you are asking for.

The HTTP server can pay attention to that header and serve up different content for different hostnames. This allows multiple websites to be hosted on a single IP address. This is very useful since IPv4 addresses are in limited supply and there are many, many websites.

You are getting the default hostname for the computer hosting facebook.com, but the webserver isn't hosting the website you want on that hostname.

Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
  • Ah. I see. What do you recommend to use instead to retrieve the website? If you can edit your answer to include this, I'll accept it – Raisus Jun 15 '16 at 12:53
  • To start out knowing the URL of the website you want to deal with (instead of the IP address of the computer hosting it) and just use that. i.e. to skip absolutely everything you are doing in the question. – Quentin Jun 15 '16 at 12:54
  • BUT... that's not how your browser does it. You don't HAVE to know the exact URL in order for your browser to interpret your intentions – Raisus Jun 15 '16 at 12:56
  • @Raisus — The only thing the browser does differently is that it will add `http://` to the front of the URL if you forget to type it (or it will shove everything over to a search engine if it looks like you aren't typing a URL at all) – Quentin Jun 15 '16 at 12:57
  • I can type into my browser "google.com" and it'll not only add http (or in facebook.com's case https) but also add the www. as well. EDIT: you don't say that all it does is add the www. as well, what about sites that don't use www, like a blog. or other? – Raisus Jun 15 '16 at 12:58
  • @Raisus — No, it won't. It will make a request to `http://google.com/` and then Google's servers will respond with an HTTP redirect response which the browser will follow. – Quentin Jun 15 '16 at 13:00
  • @Raisus that's the website itself doing that, the browser sees the dot (`.`) and interprets it as a URL so visits the `http://google.com` and the server at google.com then using something like `.htaccess` and redirects the address to `http://www.google.com` – Martin Jun 15 '16 at 13:00
  • @Quentin: and how do you tell PHP to do the same thing? as opening "`http://google.com`" won't work – Raisus Jun 15 '16 at 13:01