Get large data from API with pagination

Question

I'm trying to GET a large amount of data from the API (over 300k records). It has pagination (25 records per page) and request limit is 50 request per 3 minutes. I'm using PHP curl to get the data. The API needs JWT token authorization. I can get a single page and put its records into an array.

...
$response = curl_exec($curl);
curl_close($curl);
$result = json_decode($response, true);

The problem is I need to get all records from all pages and save it into array or file. How to do it? Maybe I should use JS to do it better?

Best regards and thank you.

I would write the raw data to files then post-process those files (files here just means something persistent; could also be a record in a database). Make your process restartable (keep track of progress either via the files you save or a separate state). If the data is changing, prefer querying by a key so you don't miss records added in middle of your process. — Allan Wind, Sep 07 '22 at 22:17
@AllanWind ok but, what about request limit do I need cron to fire the script every hour? It's gonna take a while to save all records 50*25*20 = 25k - 12hours.. — Matredok, Sep 07 '22 at 22:43
I would run it till you hit the request limit (and I would just use a long running process instead of cron). Let the server tell you when you are hitting it too hard. Not expecting that writing the files will slow you down compared to your server limits. If don't have a copy of the data, however, you pay that price if there is a problem with your processing code (likely). — Allan Wind, Sep 08 '22 at 01:35

score 0 · Answer 1 · answered Sep 07 '22 at 22:55

0

Ideally use cron and some form of storage, database or a file.

It is important that you ensure a new call to the script doesn't start unless the previous one has finished, otherwise they start stacking up and after a few you will start having server overload, failed scripts and it gets messy.

Store a value to say the script is starting.
Run the CURL request.
Once curl has been returned and data is processed and stored change the value you stored at the beginning to say the script has finished.
Run this script as a cron in the intervals you deem necessary.

A simplified example:

    <?php 

    if ($script_is_busy == 1) exit();

    $script_is_busy = 1;

    // YOUR CURL REQUEST AND PROCESSING HERE
   
    $script_is_busy = 0;

    ?>

answered Sep 07 '22 at 22:55

kissumisha

484
4
12

Why do you think database calls would build up? Then how is the cron job process going to help? Someone makes a request for records. So you only return a portion of the requested records. After 3 minutes have gone by and you can send more records, where are you going to send them? The record limitation will likely fix the problem. No one will want to use the service with the limitations. – Misunderstood Sep 08 '22 at 02:37
It depends on the query being made, sometimes they're long response times, or there are big chunks of data to be transferred. It's a bad ide to set large timeouts in PHP, hence is better to re-run the script rather than waiting the 3 minutes to do another request. There are many approaches to the same problem, this one is one of many. – kissumisha Sep 09 '22 at 09:45

Misunderstood · Answer 2 · 2022-09-09T20:52:14.923

Okay. I misinterpreted what you needed. I have more questions.

Can you do one request and get your 50 records immediately? That is assuming when you said 50 requests per 3 minutes you meant 50 records.
Why do you think there is this 50/3 limitation?
Can you provide a link to this service?
Is that 50 records per IP address?
Is leasing 5 or 6 IP addresses an option?
Do you pay for each record?
How many records does this service have total?
Do the records have a time limit on their viability.

I am thinking if you can use 6 IP addresses (or 6 processes) you can run the 6 requests simultaneously using stream_socket_client().
stream_socket_client allows you to make simultaneous requests.
You then create a loop that monitors each socket for a response.
About 10 years ago I made an app that evaluated web page quality. I ran

W3C Markup Validation
W3C CSS Validation
W3C Mobile OK
WebPageTest
My own performance test.

I put all the URLs in an array like this:

   $urls = array();
   $path = $url;
   $url = urlencode("$url");
   $urls[] = array('host' => "jigsaw.w3.org",'path' => "/css-validator/validator?uri=$url&profile=css3&usermedium=all&warning=no&lang=en&output=text");
   $urls[] = array('host' => "validator.w3.org",'path' => "/check?uri=$url&charset=%28detect+automatically%29&doctype=Inline&group=0&output=json");
   $urls[] = array('host' => "validator.w3.org",'path' => "/check?uri=$url&charset=%28detect+automatically%29&doctype=XHTML+Basic+1.1&group=0&output=json");

Then I'd make the sockets.

  foreach($urls as $path){
    $host = $path['host'];
    $path = $path['path'];
    $http = "GET $path HTTP/1.0\r\nHost: $host\r\n\r\n";
    $stream = stream_socket_client("$host:80", $errno,$errstr, 120,STREAM_CLIENT_ASYNC_CONNECT|STREAM_CLIENT_CONNECT); 
    if ($stream) {
      $sockets[] = $stream;  // supports multiple sockets
      $start[] = microtime(true);
      fwrite($stream, $http);
    }
    else { 
      $err .=  "$id Failed<br>\n";
    }
  }

Then I monitored the sockets and retrieved the response from each socket.

while (count($sockets)) {
  $read = $sockets; 
  stream_select($read, $write = NULL, $except = NULL, $timeout);
  if (count($read)) {
    foreach ($read as $r) { 
      $id = array_search($r, $sockets); 
      $data = fread($r, $buffer_size); 
      if (strlen($data) == 0) { 
     //   echo "$id Closed: " . date('h:i:s') . "\n\n\n";
        $closed[$id] = microtime(true);
        fclose($r); 
        unset($sockets[$id]);
      } 
      else {
        $result[$id] .= $data; 
      }
    }
  }
  else { 
 //   echo 'Timeout: ' . date('h:i:s') . "\n\n\n";
    break;
  }
}

I used it for years and it never failed.
It would be easy to gather the records and paginate them.
After all sockets are closed you can gather the pages and send them to your user.

Do you think the above is viable?

JS is not better.

Or did you mean 50 records each 3 minutes?

This is how I would do the pagination.
I'd organize the response into pages of 25 records per page.
In the query results while loop I'd do this:

$cnt = 0;
$page = 0;
while(...){
    $cnt++
    $response[$page][] = $record;
    if($cnt > 24){$page++, $cnt = 0;}
}
header('Content-Type: application/json');
echo json_encode($response);

@Misundestood thank you for your response :) My problem is how to return all records (once to save them, into file for example) and don't overload the server. If I can do it without cron, than how? Can I just run funcion with my request in for loop 1k times? How not to cross API limitations? — Matredok, Sep 08 '22 at 07:17
I do not understand your reasoning behind imposing the limitations. The limitations have become you problem while I doubt overloading the server is a problem. The way to minimize the load on the server is to give the user all the records they request in one response. The limitations only increase the server load. Where you could complete the response in one transaction, the self imposed limitations make things too complex and uses more server resources. You are shooting yourself in the foot with the limitations. It's 4:00 am and should get some sleep. I did up date my answer. — Misunderstood, Sep 08 '22 at 07:58
This API documentation says: The API interface has a built-in query limit. There is a limit to 50 requests in a 3-minute period. In one request I can get only 25 records otherwise I get an error: "The page size should be in the range 1-25". So how can I get all records in one request ? :) — Matredok, Sep 08 '22 at 08:07
@Matredok I'm sorry, I misunderstood the situation. I thought it was your API, not a third party. I thought you were imposing the limitations on your users ranter than some else imposing them on you. It's late. We'll figure it out. Tomorrow. Well, later today actually. — Misunderstood, Sep 08 '22 at 08:14
no problem :) Thank you very much! I'll be waiting for you! Goodnight. — Matredok, Sep 08 '22 at 08:17
I'm sorry. Today was not a good day for me. I have not forgot about you. — Misunderstood, Sep 09 '22 at 07:32
Try and discover if there is a way of getting around it. Example, if the service is blocking the token, create a new token and see if you can fetch more data. — James Wagstaff, Sep 09 '22 at 21:18
@JamesWagstaff You are correct. When there is an obstacle, find a way around it. This one is getting frustrating. There are so many possibilities to solve this one. But I need answers. And I am not getting them. — Misunderstood, Sep 10 '22 at 05:27
@Misunderstood Wow thank you for your involvement :) 1. Yes 2. The API's documentation says so. 3. The docs is in Polish :P 4. It's 50 records per token I think 5. I pay for a token. 6. It's about 5.5m 7. Idk :P — Matredok, Sep 10 '22 at 19:02

score 0 · Answer 3 · answered Sep 08 '22 at 21:02

I would use a series of requests. A typical request takes at most 2 seconds to fulfill, so 50 requests per 3oo secs does not require parallel requests. Still you need to measure time and wait if you don't want to be banned for DoS. Note that even with parallelism, curl supports it as far as I remember. When you reach the request limit you must use the sleep function to wait until you can send new requests. For PHP the real problem that it is a long running job, so you need to change settings, otherwise it will timeout. You can do it this way: Best way to manage long-running php script? As of nodejs, I think it is a lot better solution for this kind of async tasks, because the required features come naturally with nodejs without extensions and such things, though I am biased towards it.

Get large data from API with pagination

3 Answers3