-3

I'm trying to scrape some websites using CURL. In order to change the relative URL's I have inserted this:

 $curl_scraped_page = preg_replace("/<head>/i", "<head><base href='$url' />", $curl_scraped_page, 1);

It's working good for most websites but not all of them. For instance this website "NS Website" show's no effect at all, meaning the URL's are completed with my domain as base url: mydomain.com/css.css

This is the complete code Im using:

<?php

$url = $_GET['url'];

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,2);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);

$curl_scraped_page = preg_replace("/<head>/i", "<head><base href='$url' />", $curl_scraped_page, 1);

curl_close($ch);

echo $curl_scraped_page;

?>

Live example at phpfiddle

Youss
  • 4,196
  • 12
  • 55
  • 109
  • 3
    That's because you are using a regular expression to add your element. Easiest way would be with a DOMDocument. The specific reason that it is not working for your linked example site is because it has `` instead of just ``. – Jon May 05 '13 at 09:33
  • @Jon What do you mean with "DOMDocument"? Javascript? – Youss May 05 '13 at 09:35
  • 1
    @Youss http://php.net/DOMDocument - Additional this does not work for websites that have a differen ` – hakre May 05 '13 at 09:35
  • But -1 from my side: This question does not show any research effort; it is unclear or not useful. Better first understand *why* things do not work, not just dump code here and just ask why does not work. I bet you are more clever than this. – hakre May 05 '13 at 09:37
  • @hakre I didn't know why it doesn't work, therefore I couldn't not do research for this problem. You are free to downvote:) (although it doesn't help) – Youss May 05 '13 at 09:42

1 Answers1

1

Your problem is in the regular expression.

You are looking for <head>, but the given example's website has a <head profile="http://gmpg.org/xfn/11">.

Replace your regular expression with :

$curl_scraped_page = preg_replace("/<head.*>/i", "<head><base href='$url' />", $curl_scraped_page, 1);
Alain Tiemblo
  • 36,099
  • 17
  • 121
  • 153