3

We plan to use the SEMrush API, which allows access to SEO data relating to domain names and search keywords. Under their Terms of Use, they limit their usage to avoid killing their servers:

You may not perform more than 10 requests per second, nor more than 2 simultaneous requests.

We are going to be building a simple tool in PHP that aggregates data based on a domain name and are looking for the basics on how to fulfill that requirement. We are planning for hundreds/thousands of potential simultaneous users.

Maybe someone can provide some pseudo code in PHP that would let us do this - or is it really just as simple as forcing the actual API request function to sleep for 1 second in between each command? I don't have a lot of experience with APIs and large amounts of concurrent users so any help is appreciated.

halfer
  • 19,824
  • 17
  • 99
  • 186
Jared Eitnier
  • 7,012
  • 12
  • 68
  • 123
  • 1
    I should think you will need to offload the call to this API to a cron. If you do calls in your web process then there is a good chance (if your site is busy) that you will break the usage limits. So in your cron process, yes - do a `usleep` for an appropriate length of time. 0.1 seconds should do it, since you'll presumably have to do some pre- and post-processing of API calls anyway. – halfer Dec 10 '12 at 22:18
  • 1
    * I say cron, but a job queue may be more appropriate; something like Gearman maybe. However don't worry about this to start with, just get your calls working first! And don't optimise too early - get stuff working quickly, and then optimise from there. – halfer Dec 10 '12 at 22:26

2 Answers2

2

PHP is really not the best language to use for concurrent programming. However, there are some third party solutions that you can use along-side of PHP that can help you achieve your goals.

What you need is a job-manager or a queue system that can handle the actual requests for you. Since this is a back-end tool (at least that's what I gathered from your question) it doesn't require PHP to handle the actual control over the jobs themselves, but just have some controlling process schedule these individual jobs and hand them to your PHP scripts so that you can effectively impose these limits.

My first suggestion would be to try something like gearman, which is a great job manager and has an extension in PHP to help you interface with the library.

Another suggestion is to take a look at queue systems like amqp or zmq, some of which also have extensions in PHP.

So here's an example scenario for you...

You have a PHP script that accepts these requests and hands them off to your job manager or queue over a socket. The job manager or queue will store the request and distribute it off to the individual workers in an a way that can be centralized and controlled to impose these limits. There are some examples from the links I gave you that can help you get there. However, doing it purely in PHP without the aid of these tools will prove quite tricky and could wind up in some very edge-case buggy behavior if not carefully crafted and considered.

Sherif
  • 11,786
  • 3
  • 32
  • 57
  • I am checking out gearman now. Since I'm very new to process management, I would be looking for a decent tutorial if you know one, aside from the documentation that gearman directly provides. Thank you for this answer. – Jared Eitnier Dec 11 '12 at 14:38
1

Some APIs return rate limit information in the response header. Check out: Examples of HTTP API Rate Limiting HTTP Response headers This information will help you wait for a few nanoseconds, before continuing with your next request using PHP's time_nanosleep()

Some PHP libraries go pretty in-depth with their ways of rate-limiting. The Bucket Token Algorithm is pretty common across the web: https://github.com/bandwidth-throttle/token-bucket

Now I find this a bit overkill when it comes down to throttling some URL requests that don't have something like X-RateLimit-Remaining in their return header. API requests in general are usually pretty slow. So I've built the PHP script below.

This PHP script will just wait for a few milliseconds based on a $throttlerID. Higher requestsInSeconds will result in shorter wait times... If the same $throttlerID is used across simultaneous requests, each request will wait for the other using File-Locking (FLOCK()).

    function Throttler($requestsInSeconds, $throttlerID) {

        // Use FLOCK() to create a system global lock (it's crash-safe:))
        $fp = fopen(sys_get_temp_dir()."/$throttlerID", "w+");

        // exclusive lock will blocking wait until obtained
        if (flock($fp, LOCK_EX)) { 

             // Sleep for a while (requestsInSeconds should be 1 or higher)
             $time_to_sleep = 999999999 / $requestsInSeconds; 
             time_nanosleep(0, $time_to_sleep);
    
             flock($fp, LOCK_UN); // unlock
         }

        fclose($fp);

    }

Put the call to Throttler() right before each CURL call. That's it!

Jay
  • 323
  • 3
  • 8