I have to retrieve a large dataset from a web API (NCBI entrez) that limits me to a certain number of requests per second, say 10 (the example code will limit you to three without an API key). I'm using furrr's future_* functions to parallelize the requests to get them as quickly as possible, like this:
library(tidyverse)
library(rentrez)
library(furrr)
plan(multiprocess)
api_key <- "<api key>"
# this will return a crap-ton of results
srch <- entrez_search("nuccore", "Homo sapiens", use_history=T, api_key=api_key)
total <- srch$count
per_request <- 500 # get 500 records per parallel request
nrequest <- total %/% per_request + as.logical(total %% per_request)
result <- future_map(seq(nrequest),function(x) {
rstart <- (x - 1) * per_request
return(entrez_fetch(
"nuccore",
web_history = srch$web_history,
rettype="fasta",
retmode="xml",
retstart=rstart,
retmax=per_request,
api_key=api_key
))
}
Obviously for cases where nrequest > 10
(or whatever the limit is), we will immediately run afoul of the rate limiting.
I see two seemingly obvious simple solutions to this, both of which seem to work. One is to introduce a random short delay before making the request, like so:
future_map(seq(nrequest),function(x) {
Sys.sleep(runif(1,0,5))
# ...do the request...
}
The second is to limit the number of concurrent requests to the rate limit, either by plan(multiprocess,workers=<max_concurrent_requests>)
or by using the semaphore package with a semaphore set to the rate limit, like this:
# this sort of assumes individual requests take long enough to cause
# a wait for the semaphore to be long enough
# for this case, they do
rate_limit <- 10
lock = semaphore(rate_limit)
result <- future_map(seq(nrequest),function(x) {
rstart <- (x - 1) * per_request
acquire(lock)
s <- entrez_fetch(
"nuccore",
web_history = srch$web_history,
rettype="fasta",
retmode="xml",
retstart=rstart,
retmax=per_request,
api_key=api_key
)
release(lock)
return(s)
}
However, what I would really like to be able to do is limit the request rate rather than the number of concurrent requests. There's a great post by Quentin Pradet on how to do this using async io http requests in python. I made an attempt to adapt this to R, but ran into the problem that any variable shared across threads/processes in the future_* function is copied rather than actually shared, and thus modifications (even if protected by semaphore lock) are not shared among threads/processes, so it's not possible to implement the counter bucket we rely on for this method to work!
Is there a clever way to rate-limit parallel requests without necessarily capping the number of simultaneous requests? Or am I overthinking this and should just stick to limiting the number?