Multiple/Parallel API requests per second for dataframe Pandas in Python or Spark

Question

I am facing an issue when submitting parallel/multiple requests per second to a local web server (Graphhoper, used for retrieving routes information given two points in coordinates) given a dataframe in Pandas.

My dataframe has this schema (just to give an idea) which has ~1 million rows:

id  lat1  lon1  lat2  lon2

I studied a bit and found the Dask Python library which seemed to do what I was looking for. This is a reduced part of the code I am using:

imports ...

def graphhopper(lat1, lon1, lat2, lon2):
    res = requests.get("http://localhost:8989/route?point=" + lat1 + "," + lon1 + "&point=" + lat2 + "," + lon2)
    data = res.json()
    return data['paths'][0]['heading']

df = pd.read_csv('test.csv')
nCores = cpu_count()
ddf = dd.from_pandas(df, npartitions=nCores)
ddf['result'] = ddf.apply(lambda x: request_function(x.location_raw_lat, x.location_raw_lon, x.lat_lead, x.lon_lead), meta=(None, 'float64'), axis =1)
new_df = ddf.compute(scheduler='threads')

This actually manage to make 4 parallel requests per seconds (my machine has 4 core) which it looked reasonable.

What I don't understand is: why when I partition the starting dataframe with more chuncks (ex. dd.from_pandas(df, npartitions=100)), I am not able to parallelize further and it submits always 4 requests per second? Of course, this isn't acceptable because I need to handle 1 million requests and it would take far too long. How can I manage to increase the number of requests per second?

What about Spark? I tryied to use Pyspark on local mode but performances were even worst (1 request per second) with this simple configuration:

pyspark.sql.SparkSession \
  .builder \
  .appName("Test") \
  .config("spark.warehouse.dir", "C:\\temp\\hive") \
  .config('spark.sql.session.timeZone', 'UTC') \
  .getOrCreate()

I am aware that something could be missing but I don't know if it does make sense to specifies executors and their memory on local (I am actually pretty new to Spark so it's possible that I misunderstood something).

More in general, the goal is the add a column with the calculation that result from an generic REST API call (could be Google APIs or whatever), how can I manage to submit multiple request for big dataframes so that I could fulfil the task? Any help with these approach or suggestion of an alternative one? Thanks

You will be able to control number of requests more easily, if you separate your data retrieval steps with your data partitioning/processing steps. To increase the number of requests for data retrieval, you can use an approach like this https://stackoverflow.com/questions/43448042/parallel-post-requests-using-multiprocessing-and-requests-in-python — cylim, Nov 18 '19 at 01:21
I tried your approach but I am still getting at maximum 4 requests per second, I can't deal with this amount of data the with these timings. Any other solution? — marks, Nov 18 '19 at 13:19

Multiple/Parallel API requests per second for dataframe Pandas in Python or Spark

0 Answers0