1

I'm modifying a simple php crawler script.

one of the modules it uses is a converter of relative urls into absolute urls.

For this, I need to find a way to determine the base href of a given url. Otherwise I end up with a bunch of wrongly converted links.

I need a simple function to check if an url has a base href tag, and if yes, return it.

Thanks

Uno Mein Ame
  • 1,060
  • 5
  • 16
  • 29

3 Answers3

0

parse_url() splits up a URL into its parts. You can get what you need from that.

Griffin
  • 13,184
  • 4
  • 29
  • 43
  • Actually it doesn't. pathinfo is for splitting up a filepath – hoppa Apr 03 '12 at 10:18
  • D'oh, I meant parse_url! Fixed. – Griffin Apr 03 '12 at 10:19
  • @Griffin he's talking about a web crawler, so I imagine he reads for example a page on `http://host/products/foobar.html`, and in there are URL's (to other pages or images) like `images/foobar.jpg`. When combined, this will become the URL `http://host/products/images/foobar.jpg`, which is not right, since the image is in `http://host/images/foobar.jpg`. So somewhere on the `foobar.html` page, there is a `` tag. He wants to read that, to determine all relative URL's. – CodeCaster Apr 03 '12 at 11:13
0

I don't know what you exactly mean but parse_url will give you a lot of information such as the hostname, the querystring, etc.

If I understand you correctly you wan't to know if there is a http in your url. The scheme part of the information parse_url returns is your friend here. If scheme is empty or something different then http, you know that there was no http in your URL.

Inside the crawler you start crawling a specific page and you parse that HTML if I understand your question correct. Simply construct the base URL (without paths) from the information parse_url gives you and I don't see any problems.

hoppa
  • 3,011
  • 18
  • 21
0

I need a simple function to check if an url has a base href tag, and if yes, return it.

A URL cannot have a base href tag, since that is an HTML tag. It might be defined in the HTML that you retreive from that URL. How to read that can be found at this question.

Community
  • 1
  • 1
CodeCaster
  • 147,647
  • 23
  • 218
  • 272