0

I am looking to get the exact list of a url that has a list of items to store in a database and use it after. The thing is that I get only the first item of this. I want to have the list of this page and then go to page 2, then 3 then 4 ... and scrape all the links if possible.

I want to get the http:..............html of the post and the title, then go to the next page and get all the pages and so on and store them in database.

Here is the code I used:

$url ='http://newyork.craigslist.org/search/jjj?addFour=part-time';

$timeout = 10; 
$ch = curl_init($url); 

curl_setopt($ch, CURLOPT_FRESH_CONNECT, true); 
curl_setopt($ch, CURLOPT_TIMEOUT, $timeout); 
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); 
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); 
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);


  $data = curl_exec($ch);
  curl_close($ch);

 function get_matched($pattern,$data)
 {
 preg_match($pattern,$data,$match);
 return $match[1];
  }

  $pattern= "/<p>(.*?)<\/p>/";
  $caty= get_matched($pattern,$data);


 echo "$caty";

How can I do this?

Jason
  • 15,017
  • 23
  • 85
  • 116
samanta
  • 13
  • 4

2 Answers2

1
  1. Wrong use of preg_*

    preg_match will only try to find one match, and then return - you are looking for preg_match_all since you'd want more than one match.

  2. Where is the loop/recursion?

    If you'd like to do this right you'll need some sort of loop or recursive function to keep fetching data from the new links found, and the data there should be fetch following the same pattern.

    There are many resources online for how to write a simple scraper, among them are:

Community
  • 1
  • 1
Filip Roséen - refp
  • 62,493
  • 20
  • 150
  • 196
  • if i use the preg_match_all like this way function get_matched($pattern,$data) { preg_match_all($pattern,$data,$match); return $match[1]; } it gives me array as echo ?!!! not the items – samanta Dec 24 '11 at 07:58
  • thanks for the starting , will try your advice buddy and let you know ! – samanta Dec 24 '11 at 08:00
  • @samanta you wanted links to the manual, they are in the post. If you want to find more than one item you will get them back as an array, and you'll need to iterate this to get the values. Haven't worked with arrays before? http://php.net/manual/en/language.types.array.php – Filip Roséen - refp Dec 24 '11 at 08:00
  • @samanta `foreach ($match as $val) {echo $val[1];}` should be sufficient, try it out and then try to understand it using the links provided. – Filip Roséen - refp Dec 24 '11 at 08:06
  • do you know what is the regex to get http; – samanta Dec 24 '11 at 08:48
  • @samanta You should ask that in a separate question instead of asking everyone in comments. – Filip Roséen - refp Dec 24 '11 at 10:26
0

this is the best link:

http://php.net/manual/en/book.curl.php

Community
  • 1
  • 1
xkeshav
  • 53,360
  • 44
  • 177
  • 245
  • i would like something more clear , i have been triyng to do this since a week in my part time , but without any success – samanta Dec 24 '11 at 07:42
  • i'm not having error but the result is only 1 item from what i try to scrape , i want the whole page it s about 100 ithem / page then i want to go to the second page and do the same and so ! – samanta Dec 24 '11 at 07:52
  • some advice on how i should echo the result ? – samanta Dec 24 '11 at 07:54
  • do : `echo "
    "; print_r($data);` and see what is coming??
    – xkeshav Dec 24 '11 at 08:21