1

I have a list of over 500 urls that i have to scrape because my distributor doesn't offer an api or a csv. The list is actually an array containing the ids of those products that i want to keep track of:

$arr = [1,2,3,...,564];

The url is the same, you only change the id at the end of it:

$url = 'https://shop.com/products.php?id='

Now, on localhost i used a foreach loop to scrape each and everyone of those urls:

foreach($arr as $id){
    
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url . $id);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_MAXREDIRS, 1);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $result = curl_exec($ch);
   
    //preg_meth_all - get the data that i'm looking for
    //put that data into an array

    curl_close($ch);

}

But the problem is that, first of all, i think that that's not wise at all - and i know that for a fact because when i(accidentally) ran the script(on localhost) my access was banned/blocked to that shop.com - getting the message: Too many requests...429:D.

I was trying to sleep that foreach every 10 loops using 10 as modulus

$x = 0;
foreach($arr as $id){
   
    //ch request - get data and add it into an array 
    $x++;
    if($x % 10 == 0){
        sleep(2);
    }      

}

But this takes like forever to execute.

Even tho i am able to connect and take the date that i need from each individual product i want to find a solution using curl(since there's no api nor csv) that will run that script at once but in a safe/wise way.

Is there something like that? If yes, can you please help me understand how?

Thank you!

emma
  • 761
  • 5
  • 20
  • Have you asked the supplier how many requests you can make before they blocked you. Can you tune it to just below their limit? – Nigel Ren Aug 21 '18 at 18:45
  • Hey @NigelRen, i didn't ask them, i ran the script accidentally but that helped me understand that it's too much :D - And yes, i can code it to run only 10 at a time every 1-2 minutes trough a cronjob but i don't see this options as being too smart and/or extendable in the future... – emma Aug 21 '18 at 18:48
  • It may be that they allow 100 requests a second or 20. If it's 100 you can do it in a few larger batches, if they only allow 20 requests a second then you may have to put up with the long run. – Nigel Ren Aug 21 '18 at 18:50
  • @NigelRen, but isn't there another way? I was thinking about running 10 requests at a time but that means at least 8.3(3) requests per second to keep it in 1h(in my scenario - 500 urls and 1h) - i don't understand why wouldn't they provide me with an api :( i hate them :D – emma Aug 21 '18 at 18:54
  • Without knowing who the supplier is, it's difficult. I setup some designs on a 3D printing place, they provided API's for all sorts but you couldn't get a list of what you had actually sold, so they can be a pain at times :-/ Worth checking their limit though, see if they will tell you how often you can make requests. – Nigel Ren Aug 21 '18 at 19:01
  • Either ask the supplier to up the limit of requests or find another way to obtain the data I would say. It should not be too difficult to explain to your supplier that it would be much much **much** faster and cleaner to obtain the data through a suitable interface ;) Good Luck! – Evochrome Aug 21 '18 at 19:06
  • Hey @Evochrome, i've sent them an email right now - but if the answer is no for both(rise the limit and build an api) is there any other option? – emma Aug 21 '18 at 19:07
  • There probably is (although again, less optimized), depending on the solution you would like to have. Does the content need to be refreshed everytime? Does it(raw data) need to be non accessible by the client? What kind of content do you crawl? Maybe you could try storing the data locally and getting new productdata every now and then using a cronjob – Evochrome Aug 21 '18 at 19:14
  • @Evochrome, i'm trying to get the stock which, because it's a distributor, is not available unless i log in(which i managed to do trough yet another curl request.) - so yep :D, the stock is the only thing that i need but because it is dynamic ... i need to send those requests every day ... – emma Aug 21 '18 at 19:17
  • @emma I would then say to use [`cronjobs`](https://stackoverflow.com/questions/26155160/how-can-i-write-a-php-cron-script) as a mentioned earlier. The best ("hacky") practice would then be to crawl the urls at different time points to avoid the 429 error. After that, if you store it in a cookie or database, you should be able to use the data on the spot. If you have resources, you could also try routing the requests via different servers or VPN's, as that might trick the supplier's server. – Evochrome Aug 21 '18 at 21:00
  • @Evochrome, oky :( - i've made it to run 10 of them / minute - it takes a lot of time but a cron is running the script and then it updates my database sooo :-?? i guess it's bad for them :D but if they don't want to invest in an api then it's their fault...right? :D – emma Aug 21 '18 at 21:48
  • 1
    Hey Emma, any updates? I also just thought that you might be able to simply use `page = file_get_contents("example.com?id=".$id)` instead of your curly business ;) – Evochrome Aug 24 '18 at 21:10
  • Hey @Evochrome, yes, the conclusion is that there is no resource wise way - as far as for now but i've made a little script and i'll post it below :D(as for file_get_contents i can't use it because i have a redirect on each url since i only know the id but the url contains the seo friendly title of those products too) – emma Aug 25 '18 at 07:53

2 Answers2

1

have a daemon or cronjob that is constantly updating a db 24/7 at safe a safe pace, and whenever you need instant results, just query the db instead of the actual website. if a safe pace is too slow, just keep adding more IP's (use proxys) until it's at an acceptable pace.

hanshenrik
  • 19,904
  • 4
  • 43
  • 89
  • Hey @hanshenrick, the thing is that the stock is updating on the distributor website, not on mine so to keep that stock updated i needed a way to constantly crawl their website(i will leave below the final form - that i'm sure it can be improved but for now it works and while i'm learning more about curl i will leave it that way) – emma Aug 25 '18 at 07:19
0

UPDATE:

First of all i want to say thank you to all those who answered my question :D

After a few days of reading and trial and error I've reached a conclusion; on which i'm not completely satisfied - so i'll keep trying to search for a better solution but for now this is my result:

First i've added a new column to my table in which i'm saving the time() based on this every time that cron job runs the script i'm selecting the next 30 products which were not updated in the past 12h - i'm running that cron job every 10 minutes. This is the loop:

//get the id's for 30 products that haven't been updated in the past 12 hours
$now = time() - 43200;
$products = $pdo->query("SELECT id FROM products WHERE last_update < $now LIMIT 30");

$bind = [];

foreach($products as $id){
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, 'http://shop.com?product=' . $id[0] . '.html');
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_MAXREDIRS, 1);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $response = curl_exec($ch);

    preg_match_all('!<span class="stock">(.*?)<\/span>!', $response, $data);
    $stock = $data[1];

    array_push($bind, [$stock, time(), $id[0]]);

    curl_close($ch);
    sleep(2);
}

//then i'm just updating this results trough a query
//i'm using PDO do deal with my db

I think that the most important thing here is to sleep that look each and every time this way i didn't get the 429 and actually the execution for 30 products at a time happens pretty fast i mean it takes around 1.5 minutes to complete but i'm avoiding the too many requests thing and it is ran by cron job so i don't really have to do anything.

The limitation with this way of doing things is that by using time() if you have more prodacts that can be fitted in a 12 hours cycle the script will simply start with the first 30 products that haven't been updated in the past 12 hours - but to solve this "problem" i'm thinking about saving a counter in a db table so i can use it to start from X each and every time that script is ran and then update it to X + 30.

Using curl for crawling websites doesn't seem to me as the resource-wise solution but it is one that can take crawling off of your hands.

Again, i'm not 100% satisfied with the way i've written this script but for now it works.

If i'll ever find a better solution i'll post it here.

Thank you!

emma
  • 761
  • 5
  • 20
  • 2
    `SELECT id FROM products WHERE last_update < $now LIMIT 30` this will get you 30 *random* things that are too old, but you should instead get the 30 *oldest* things that are too old, try `SELECT id FROM products WHERE last_update < $now ORDER BY last_update ASC LIMIT 30` - this will get you the 30 oldest things ^^ – hanshenrik Aug 26 '18 at 08:32
  • 1
    and using curl to crawl is fine resource wise, but you creating & deleting a curl instance on every iteration, that's a waste of cpu and disables connection-keepalive, it'd be faster if you just re-used the same curl handle over and over until you're done. move curl_init() outside above the foreach loop, and move curl_close() outside below it, and it'll go faster. also it will go even faster if you enable CURLOPT_ENCODING – hanshenrik Aug 26 '18 at 08:34
  • 2
    also.. you're not urlencoding `$id[0]`, that's probably a bug, try `'http://shop.com?product=' . urlencode($id[0]) . '.html'` instead. and you're parsing [HTML with regex.](https://stackoverflow.com/a/1732454/1067003), that's a sin, and unreliable, try `$domd=@DOMDocument::loadHTML($response);$xp=new DOMXPath($domd);$stock=$domd->saveHTML($xp->query('//span[@class="stock"]')->item(0));` instead, should be more reliable in case a `` ever pops up inside of the span you want to parse – hanshenrik Aug 26 '18 at 08:44