Using Simple HTML DOM to get absolute URLs

Question

What I want to do: Scape all the links from a page using Simple HTML DOM while taking care to get full links (i.e. from http:// all the way to the end of the address).

My Problem: I get links like /wiki/Cell_wall instead of http://www.wikipedia.com/wiki/Cell_wall.

More examples: If I scrape the URL: http://en.wikipedia.org/wiki/Leaf, I get links like /wiki/Cataphyll, and //en.wikipedia.org/. Or if I'm scraping http://php.net/manual/en/function.strpos.php, I get links like function.strripos.php.

I've tried so many different techniques of building the actual full URL, but there are so many possible cases that I am completely at a loss as to how I can possibly cover all the bases.

However, I'm sure there are many people who've had this problem before - which is why I turn to you!

P.S I suppose this question could almost be reduced to just handling local hrefs, but as mentioned above, I've come across //en.wikipedia.org/ which is not a full url and yet is not local.

Use regular expression for this, see regex here http://stackoverflow.com/questions/833469/regular-expression-for-url — Rajiv Pingale, Dec 03 '12 at 07:20
Do you need to scrape many pages? Because instead of using a scraper, you could also use headless javascript (http://phantomjs.org/) so you can get the url by using javascript. This however means it will be much slower than just scraping it. — sroes, Dec 03 '12 at 07:23
I think your question could should be shortened to: how to combine absolute url and relative url in PHP. — Alexei Levenkov, Dec 03 '12 at 07:23
@Rajiv Pingale - Okay, I can see this will help, but I am not just trying to test whether a link is full or not - I'm trying to get the full url - whether that means I have to construct it, or somehow scrape it. — , Dec 03 '12 at 07:27
@sander Roes: Yes, I've got to scrape many pages, but thanks — , Dec 03 '12 at 07:27
@AlexeiLevenkov I think you're right - read my P.S though. I will change the title (if I can?) — , Dec 03 '12 at 07:28
Why can't you just check if the URL starts with `http://`, and if it doesn't, just concatenate the URL you're scraping with it. (If the URL starts with a /, you'd just concatenate it with the base URL.) — HellaMad, Dec 03 '12 at 07:35
@DC_ I've tried this, but stumbled across some problems, as above: Scraping `http://en.wikipedia.org/wiki/Leaf` (the base URL), I get links like `/wiki/Cataphyll`. Concatenating: `http://en.wikipedia.org/wiki/Leaf/wiki/Cataphyll` ... which doesn't make much sense. I would have thought that `/wiki/Cataphyll` should link to `http://en.wikipedia.org/wiki/wiki/Cataphyll` because it has a '/' at the beginning, but it actually links to `http://en.wikipedia.org/wiki/Cataphyll` — , Dec 03 '12 at 07:40
@JoeRocc When I say *base URL*, I am referring to `http://en.wikipedia.org/`. I suppose it might be considered the web root. — HellaMad, Dec 03 '12 at 07:46
I have given the reference regular expression, in your scraper you can used that, But I will need to see the content of the page which is you are scraping. Hope you have full URL in href. If not then you will have an option to save base url in any variable [http://stackoverflow.com/a/10326353/699695]. and you need to connect both URL Hope this is will solve your problem — Rajiv Pingale, Dec 03 '12 at 10:18

score 1 · Answer 1 · answered Dec 06 '12 at 00:48

1

I think this is what you're looking for. It worked for me on an old project.

http://www.electrictoolbox.com/php-resolve-relative-urls-absolute/

answered Dec 06 '12 at 00:48

Paul Dessert

6,363
8
47
74

Found had to [`urldecode`](https://secure.php.net/manual/en/function.urldecode.php) the library output URL, otherwise would concatenate pagination query strings per loop from the relative URLs (even when [`unset`](https://secure.php.net/manual/en/function.unset.php)ing). – Leo Oct 19 '16 at 11:09

score 1 · Answer 2 · edited Oct 19 '16 at 00:17

1

You need a library that converts relative urls to absolute. URL To Absolute seems popular. Then you just:

require('url_to_absolute.php');

foreach($doc->find('a[href]') as $a){
  echo url_to_absolute('http://en.wikipedia.org/wiki/Leaf', $a->href) . "\n";
}

See PHP: How to resolve a relative url for a list of libraries.

edited Oct 19 '16 at 00:17

Ryan Bemrose

9,018
1
41
54

answered Dec 06 '12 at 02:50

pguardiario

53,827
19
119
159

1

website is broken – Leo Oct 05 '16 at 16:46
Updated *URL To Absolute*, edit rejected, URL remains broken. – Leo Oct 25 '16 at 11:31

score 0 · Answer 3 · answered Dec 03 '12 at 07:23

0

I don't know if this is what you are looking for, but this will give you the full URL of the page it is executed from:

window.location.href

Hope it helps.

answered Dec 03 '12 at 07:23

Dumle29

809
1
6
13

score 0 · Answer 4 · answered Dec 03 '12 at 13:48

Okay, thanks everyone for your comments.

I think the solution is to use regex to find the webroot of any particular URL, then simply append the local address to this.

Tricky part: Designing a regex statement that works for all domains, including their subdomains...

Using Simple HTML DOM to get *absolute* URLs

4 Answers4

Using Simple HTML DOM to get absolute URLs