0

I have a problem with the Simple PHP DOM Parser. I basically have to scrape a catalogue site for the images and their titles.

The site is have to scrape is http://pinesite.com.

I have come up with the following code to do it (this will be called via AJAX):

<?php
include ('simple_html_dom.php');
$function = $_GET['function'];
switch($function) {
  case 'subcat':
    $maincat = $_GET['cat'];
    $url = "http://www.pinesite.com/meubelen/index.php?".$maincat."&lang=de";
    $html = file_get_html($url);
    $data = $html->find('.box_166_content .act_path li a');
    $output ="";
    foreach ($data as $subcat) {
      $title = $subcat->plaintext;
      $href = $subcat->href;
      $link['title'] = $title;
      $link['href'] =substr($href,10);
      $output[] = $link;
    }
    echo json_encode($output);
    $html->clear();
    unset($html);
    unset($url);
    break;

  case 'images':
    $subcat = $_GET['subcat'];
    $url = "http://www.pinesite.com/meubelen/index.php?".$subcat;
    $html = file_get_html($url);
    $iframe = $html->find('#the_iframe',0);
    $url2 = $iframe->src;
    $html->clear(); 
    unset($html);

    $html2 = file_get_html("http://www.pinesite.com/meubelen/".$url2);
    $titles = $html2->find('p');
    $images = $html2->find('img');
    $output='';
    $i=0;
    foreach ($images as $image) {
      $item['title'] = $titles[$i]->plaintext;
      $item['thumb'] = $image->src;
      $item['image'] = str_replace('thumb_','',$image->src);
      $output[] = $item;
      $i++;
    }
    echo json_encode($output);
    break;
}
?>

So that's the "functions" file, the part that doesn't work is the last case.

I don't know what wrong here, so I tested it (the last case) in a separate file (I put the URL that it gets from the iFrame (that part does work):

<?php
include_once "simple_html_dom.php";

$fullurl = "http://www.pinesite.com/meubelen/prog/browse.php?taal=nl&groep=18&subgroep=26";

$html = file_get_html($fullurl);
$titles = $html->find('p');
$images = $html->find('img');
$output='';
$i=0;
foreach ($images as $image) {
  $item['title'] = $titles[$i]->plaintext;
  $item['thumb'] = $image->src;
  $item['image'] = str_replace('thumb_','',$image->src);
  $output[] =$item;
  $i++;
}
echo json_encode($output);
?>

Like I said the first part should return the same as the second (if you add ?function=images&subcat=dichte-kast) but it doesn't. I'm guessing it is because I use the parser multiple times.

Does anybody have a suggestion for me?

vcsjones
  • 138,677
  • 31
  • 291
  • 286
Tobias Timpe
  • 720
  • 2
  • 13
  • 27
  • Nowhere have you actually checked if the url retrieval worked. Does `$url2` actually have a valid url in it? does `$html2` have some page contents? Your script utterly depends on the server's network connection being stable and the remote site being available, with no margin for ANY error. – Marc B Nov 15 '11 at 15:26
  • I know :), this is just a test of the scraping, I will fix all that before it goes live. – Tobias Timpe Nov 15 '11 at 18:30

2 Answers2

1

The problem lies in the fact that your $url2 variable contains html entities and when you concat it to the root url the result is not a valid url. Therefore, the file_get_html() function will not retrieve the url (and thus the data) you expect, but something different.

A quick solution to your problem is html_entity_decode(), but you might want to read up on debugging too. It can be as easy as applying var_dump(); to every variable you're using and see where the output is different from the output you expect.

You might also want to check on some security issues, too. Writing $subcat = $_GET['sub_cat'] is in no way safer than using $_GET['sub_cat'] directly.

vindia
  • 1,678
  • 10
  • 14
0

Im not sure I understand the question completely but from what I can gather is that you are trying to grab some images and their associated titles from a given webpage and then save them? If thats the case then here is some food for thought. (sorry it could not be more specific).

use file_get_contents to grab the html contents.

$html = file_get_contents('www.someurl.com');

then preg_match() all of the image tags and other data you may need. There is lots of info out there on how to do this Matching SRC attribute of IMG tag using preg_match

 $matches = preg_match('<img>*<\/img>', $html); # this is a guess

Once you have a collection of image tags as an array then use curl to save the images

http://www.edmondscommerce.co.uk/php/php-save-images-using-curl/

I think the problem you have is stripping the html content fromt he content that you want

Community
  • 1
  • 1
Robbo_UK
  • 11,351
  • 25
  • 81
  • 117
  • He's already using a DOM parser to do this. Besides, his problem is not in his parsing method. – vindia Nov 15 '11 at 16:05
  • ahh, I misunderstood the question – Robbo_UK Nov 15 '11 at 16:16
  • Just check out http://pinesite, click on a category and then a sub category on the left. All that I want to do is get the src of the product images plus their titles in a JSON format so that I can use them. – Tobias Timpe Nov 15 '11 at 16:37