6

I have to retrieve a large dataset from a web API (NCBI entrez) that limits me to a certain number of requests per second, say 10 (the example code will limit you to three without an API key). I'm using furrr's future_* functions to parallelize the requests to get them as quickly as possible, like this:

library(tidyverse)
library(rentrez)
library(furrr)

plan(multiprocess)

api_key <- "<api key>"
# this will return a crap-ton of results
srch <- entrez_search("nuccore", "Homo sapiens", use_history=T, api_key=api_key)

total <- srch$count
per_request <- 500 # get 500 records per parallel request
nrequest <- total %/% per_request + as.logical(total %% per_request)

result <- future_map(seq(nrequest),function(x) {
  rstart <- (x - 1) * per_request
  return(entrez_fetch(
    "nuccore",
    web_history = srch$web_history,
    rettype="fasta",
    retmode="xml",
    retstart=rstart,
    retmax=per_request,
    api_key=api_key
  ))
}

Obviously for cases where nrequest > 10 (or whatever the limit is), we will immediately run afoul of the rate limiting.

I see two seemingly obvious simple solutions to this, both of which seem to work. One is to introduce a random short delay before making the request, like so:

future_map(seq(nrequest),function(x) {
  Sys.sleep(runif(1,0,5))
  # ...do the request...
}

The second is to limit the number of concurrent requests to the rate limit, either by plan(multiprocess,workers=<max_concurrent_requests>) or by using the semaphore package with a semaphore set to the rate limit, like this:

# this sort of assumes individual requests take long enough to cause
# a wait for the semaphore to be long enough
# for this case, they do
rate_limit <- 10
lock = semaphore(rate_limit)
result <- future_map(seq(nrequest),function(x) {
  rstart <- (x - 1) * per_request
  acquire(lock)
  s <- entrez_fetch(
    "nuccore",
    web_history = srch$web_history,
    rettype="fasta",
    retmode="xml",
    retstart=rstart,
    retmax=per_request,
    api_key=api_key
  )
  release(lock)
  return(s)
}

However, what I would really like to be able to do is limit the request rate rather than the number of concurrent requests. There's a great post by Quentin Pradet on how to do this using async io http requests in python. I made an attempt to adapt this to R, but ran into the problem that any variable shared across threads/processes in the future_* function is copied rather than actually shared, and thus modifications (even if protected by semaphore lock) are not shared among threads/processes, so it's not possible to implement the counter bucket we rely on for this method to work!

Is there a clever way to rate-limit parallel requests without necessarily capping the number of simultaneous requests? Or am I overthinking this and should just stick to limiting the number?

Luther Blissett
  • 373
  • 1
  • 10

0 Answers0