Saving content from specific Divs of several static websites and replacing Img urls to local copies

Question

A client of mine has hundreds of simple, single page static websites. They are landing pages for various marketing tasks. They all use the same identical layout -- just a simple two column site with header and footer.

I want to copy the content of a few specific divs on each of these landing pages and then I will use them to popular my database so I can rebuild it with a new backend.

Basically there is a "main" div and a "sidebar" div and I need to copy the HTML exactly as is but replace the image urls to locally hosted copies.

I was able to create an array of all image URLs for a given domain using this:

$url="http://example.com";
$html = file_get_contents($url);
$doc = new DOMDocument();
@$doc->loadHTML($html);

$tags = $doc->getElementsByTagName('img');
foreach ($tags as $tag) {
// save image to local server
}

I was able to capture the content of the main div using this method:

$maindiv = $doc->getElementById('main');
echo $doc->saveHTML($maindiv);

which seemed to work good, but it did not include any inner HTML for images. Basically this div contains a paragraph followed by an HTML bullet list, followed by an image or two, and perhaps a final paragraph. This code grabbed the text and bullet lists but did not grab the html or images.

Is there a better way to do this? If I can figure out how to iterate over this data and grab the contents of these divs I can really save a lot of manual time.

http://stackoverflow.com/questions/2087103/innerhtml-in-phps-domdocument — Steve, Nov 09 '14 at 21:11

Saving content from specific Divs of several static websites and replacing Img urls to local copies

0 Answers0