How to parse curl URL, CSS and images?

Question

Possible Duplicate:
How do I save a web page, programatically?

I'm just starting with curl and I've managed to pull an external website:

function get_data($url) {
  $ch = curl_init();
  $timeout = 5;
  curl_setopt($ch,CURLOPT_USERAGENT, $userAgent);
  curl_setopt($ch,CURLOPT_URL,$url);
  curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
  curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
  curl_setopt($ch,CURLOPT_FOLLOWLOCATION,true);
  $data = curl_exec($ch);
  curl_close($ch);
  return $data;
}
$test = get_data("http://www.selfridges.com");
echo $test;

However the CSS and images are not included. I need to be also retrieve the CSS and images, basically the whole website. Can someone please post a brief way for me to get started in understanding how to parse the CSS, images and URL to get me going?

score 1 · Answer 1 · answered Jan 20 '13 at 10:28

1

There are better tools to do this than PHP, eg. wget with the --page-requisites parameter.

Note however that automatic scraping is often a violation of the site's TOS.

answered Jan 20 '13 at 10:28

Pekka

442,112
142
972
1,088

Thanks for the info, however I read that wget isn't able to download dynamic PHP sites. If I used wget wouldn't I lose out on a lot of the contents and functions? – user208709 Jan 20 '13 at 10:31
@user that's not true - on the outside, a web site is a web site, whether it's static or generated dynamically by PHP doesn't matter. What this method doesn't catch is dynamic *Javascript* but that is a whole nother ballgame anyway – Pekka Jan 20 '13 at 10:32
So if I understand correctly, to accomplish what I need using wget will preserve all of the links, URL's, images, CSS etc of a website. This would all be in a folder on my local server where I can then call up simply like this? localhost:8888/downloadedSite/index.html And for from a user's perspective all the downloaded site would function just like the live site? – user208709 Jan 20 '13 at 10:36
@user for very simple web sites, yes. However, there is a lot of things that can break, especially nowadays with sites loading data (sometimes their entire content) through Ajax. That functionality cannot be easily replicated offline. I'd say give it a try, and test the end result thoroughly, but be aware that most web sites depend on live servers these days – Pekka Jan 20 '13 at 10:37
So I guess it's still a no go for me then, I need to have everything working like it's the live site. e.g. if it's an ecommerce site things would probably break. If I parse the necessary elements in curl, will I be able to get the same functions as wget but without anything breaking? – user208709 Jan 20 '13 at 10:43
@user nope. curl and wget do essentially the same thing, they read the HTML code of the page as it is the moment the page is delivered. Replicating Ajax functionality (that changes the source code after loading) in an offline site is close to impossible. – Pekka Jan 20 '13 at 10:49
Replicating server side dynamic features is even closer to impossible. – Quentin Jan 20 '13 at 11:19

score 0 · Answer 2 · edited May 23 '17 at 12:26

0

There are HTML parsers for PHP. There are qute a few available, here's a post that discusses that: How do you parse and process HTML/XML in PHP?

edited May 23 '17 at 12:26

Community

1
1

answered Jan 20 '13 at 10:32

Peter Wooster

6,009
2
27
39

How to parse curl URL, CSS and images?

2 Answers2