is there a way to make 270K API requests in a fast way using pandas data frames?

Question

I have the following dataset (this is just 5 observations out of 270K):

ride_id_hash  start_lat  start_lon    end_lat    end_lon
0  3913897986482697189  52.514532  13.389043  52.513640  13.380389
1  -742299897365889600  52.515073  13.394545  52.518483  13.407965
2  -319522908151650562  52.528291  13.409570  52.526965  13.415183
3 -3247332391891858034  52.514276  13.453460  52.503649  13.448538
4  5368542961279891088  52.504448  13.468405  52.511891  13.474998

I would like to make requests to an API and retrieve the estimated distance and time based on longitude-latitude origin-destination from Mapbox. As I have to make it with 270 rows, the process is veeeery slow. I coded the following:

acces_token = 'pk.eyXXXX'
        
def create_route_json(row):
    """Get route JSON."""
    base_url = 'https://api.mapbox.com/directions/v5/mapbox/cycling/'
    url = base_url + str(row['Start Lon']) + \
        ',' + str(row['Start Lat']) + \
        ';' + str(row['End Lon']) + \
        ',' + str(row['End Lat'])
    print('start lon: ' + str(row['Start Lon']))
    print('start lat: ' + str(row['Start Lat']))
    print('end lon: ' + str(row['End Lon']))
    print('end lat: ' + str(row['End Lat']))
    params = {
        'geometries': 'geojson',
        'access_token': acces_token
    }
    
    req = requests.get(url, params=params)
    route_json = req.json()['routes'][0]
    print('Duration: ' + str(route_json['duration']))
    print('Distance: ' + str(route_json['distance']))
    print('^'*40)
    row['Estimated_duration'] = route_json['duration']
    row['Estimated_distance'] = route_json['distance']
    #mystr= str(route_json['duration']) + ';'+ str(route_json['distance'])
    return row

start = time.time()
df_process.apply(create_route_json, axis=1)

I am wondering if there is a way to make 270K calls in a fast way. (you need to have your own access token from Mapbox Directions API

Usually the remote server will implement rate limiting, so it -- not whatever code you run on your side -- will be the limiting factor. — Charles Duffy, Dec 21 '20 at 16:35
And frankly, if I were running a web service and someone put it under that kind of load, I'd be cutting off their API token unless they'd made special arrangements and paid for the relevant level of service (potentially including a dedicated node/endpoint to avoid disrupting other clients). — Charles Duffy, Dec 21 '20 at 16:37
...by contrast, if they have a batch API that lets you submit 270K calls at once and get back to you later when they have a response, you'll need to find the documentation for that API. If they don't have one, we're back in the place of needing to ask questions of the people running the service. — Charles Duffy, Dec 21 '20 at 16:38
Anyhow: Is it possible to parallelize and run N requests at a time (for any number N that you want)? Sure, lots of ways, most of them already covered in questions asked and answered here at Stack Overflow (see f/e discussion of thread pools, multiprocessing, async http client frameworks, etc). But get too aggressive about it, and you can expect mapbox to cut you off (or they may well have a limit on how many requests they'll process at a time per API key in the first place). — Charles Duffy, Dec 21 '20 at 16:40
[What is the fastest way to send 100,000 http requests in Python?](https://stackoverflow.com/questions/2632520/what-is-the-fastest-way-to-send-100-000-http-requests-in-python) is one such pre-answered question you might start with if your goal is to have Mapbox cut you off for abuse of your service. — Charles Duffy, Dec 21 '20 at 16:41
@CharlesDuffy You are right, I did not take into account that. There is a limit rate of 300 requests and 25 coordinates per request per minute. Then I am thinking to just do 300*25=7500 observations/min. As I have 270K, then It would take around 36 mins to compute. Is there any way to wait until reaching 1 min and then run again? — sanchezjAI, Dec 21 '20 at 19:02
When it comes to rate limiting, I'm personally a big fan of [token bucket](https://en.wikipedia.org/wiki/Token_bucket) algorithms -- that way you can put a cap in place but let things run faster until they catch up to that cap. Anyhow -- if you use a library like https://github.com/falconry/token-bucket/ you can have a thread pool that runs at full steam as long as there are tokens left in the bucket, and then slows down when they run out. — Charles Duffy, Dec 21 '20 at 21:22
Anyhow -- personally, I'd implement this with a thread pool, one thread per allowed concurrent request, with each thread needing to grab a token from the bucket before it actually goes through with connecting to the remote server. Batch your coordinates into sets of 25, put each batch of 25 coordinates into a [queue](https://docs.python.org/3/library/queue.html), have the threads in the pool pull from that queue, and there you are. — Charles Duffy, Dec 21 '20 at 21:46

is there a way to make 270K API requests in a fast way using pandas data frames?

0 Answers0