0

I am facing an issue when submitting parallel/multiple requests per second to a local web server (Graphhoper, used for retrieving routes information given two points in coordinates) given a dataframe in Pandas.

My dataframe has this schema (just to give an idea) which has ~1 million rows:

id  lat1  lon1  lat2  lon2

I studied a bit and found the Dask Python library which seemed to do what I was looking for. This is a reduced part of the code I am using:

imports ...

def graphhopper(lat1, lon1, lat2, lon2):
    res = requests.get("http://localhost:8989/route?point=" + lat1 + "," + lon1 + "&point=" + lat2 + "," + lon2)
    data = res.json()
    return data['paths'][0]['heading']

df = pd.read_csv('test.csv')
nCores = cpu_count()
ddf = dd.from_pandas(df, npartitions=nCores)
ddf['result'] = ddf.apply(lambda x: request_function(x.location_raw_lat, x.location_raw_lon, x.lat_lead, x.lon_lead), meta=(None, 'float64'), axis =1)
new_df = ddf.compute(scheduler='threads')

This actually manage to make 4 parallel requests per seconds (my machine has 4 core) which it looked reasonable.

What I don't understand is: why when I partition the starting dataframe with more chuncks (ex. dd.from_pandas(df, npartitions=100)), I am not able to parallelize further and it submits always 4 requests per second? Of course, this isn't acceptable because I need to handle 1 million requests and it would take far too long. How can I manage to increase the number of requests per second?

What about Spark? I tryied to use Pyspark on local mode but performances were even worst (1 request per second) with this simple configuration:

pyspark.sql.SparkSession \
  .builder \
  .appName("Test") \
  .config("spark.warehouse.dir", "C:\\temp\\hive") \
  .config('spark.sql.session.timeZone', 'UTC') \
  .getOrCreate()

I am aware that something could be missing but I don't know if it does make sense to specifies executors and their memory on local (I am actually pretty new to Spark so it's possible that I misunderstood something).

More in general, the goal is the add a column with the calculation that result from an generic REST API call (could be Google APIs or whatever), how can I manage to submit multiple request for big dataframes so that I could fulfil the task? Any help with these approach or suggestion of an alternative one? Thanks

marks
  • 31
  • 9
  • You will be able to control number of requests more easily, if you separate your data retrieval steps with your data partitioning/processing steps. To increase the number of requests for data retrieval, you can use an approach like this https://stackoverflow.com/questions/43448042/parallel-post-requests-using-multiprocessing-and-requests-in-python – cylim Nov 18 '19 at 01:21
  • I tried your approach but I am still getting at maximum 4 requests per second, I can't deal with this amount of data the with these timings. Any other solution? – marks Nov 18 '19 at 13:19

0 Answers0