0

I've been trying to write a simple script in PHP to pull off data from a ISBN database site. and for some reason I've had nothing but issues using the file_get_contents command.. I've managed to get something working for this now, but would just like to see if anyone knows why this wasn't working?

The below would not populate the $page with any information so the preg matches below failed to get any information. If anyone knows what the hell was stopping this would be great?

$links = array ('
    http://www.isbndb.com/book/2009_cfa_exam_level_2_schweser_practice_exams_volume_2','
    http://www.isbndb.com/book/uniform_investment_adviser_law_exam_series_65','
    http://www.isbndb.com/book/waterworks_a02','
    http://www.isbndb.com/book/winning_the_toughest_customer_the_essential_guide_to_selling','
    http://www.isbndb.com/book/yale_daily_news_guide_to_fellowships_and_grants'

    ); // array of URLs

foreach ($links as $link)
{

    $page = file_get_contents($link);
    #print $page;

                preg_match("@<h1 itemprop='name'>(.*?)</h1>@is",$page,$title);
                preg_match("@<a itemprop='publisher' href='http://isbndb.com/publisher/(.*?)'>(.*?)</a>@is",$page,$publisher);
                preg_match("@<span>ISBN10: <span itemprop='isbn'>(.*?)</span>@is",$page,$isbn10);
                preg_match("@<span>ISBN13: <span itemprop='isbn'>(.*?)</span>@is",$page,$isbn13);
                        echo '<tr>
                        <td>'.$title[1].'</td>
                        <td>'.$publisher[2].'</td>
                        <td>'.$isbn10[1].'</td>
                        <td>'.$isbn13[1].'</td>
                        </tr>'; 
                        #exit();                                    

            }
Cœur
  • 37,241
  • 25
  • 195
  • 267
  • 4
    There's a newline before each of your URLs, could that be causing the issue? – Sean Sep 12 '14 at 13:48
  • 1
    Never parse html with regex http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Bogdan Burym Sep 12 '14 at 13:50

1 Answers1

2

My guess is you have wrong (not direct) URLs. Proper ones should be without the www. part - if you fire any of them and inspect the returned headers, you'll see that you're redirected (HTTP 301) to another URL.

The best way to do it in my opinion is to use cURL among curl_setopt with options CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS.

Of course you should trim your urls beforehands just to be sure it's not the problem.

Example here:

$curl = curl_init();
foreach ($links as $link) {

   curl_setopt($curl, CURLOPT_URL, $link);
   curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
   curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);
   curl_setopt($curl, CURLOPT_MAXREDIRS, 5); // max 5 redirects

   $result = curl_exec($curl);
   if (! $result) {
      continue; // if $result is empty or false - ignore and continue;
   }

   // do what you need to do here
}
curl_close($curl);
Kleskowy
  • 2,648
  • 1
  • 16
  • 19