I am facing an issue when submitting parallel/multiple requests per second to a local web server (Graphhoper, used for retrieving routes information given two points in coordinates) given a dataframe in Pandas.
My dataframe has this schema (just to give an idea) which has ~1 million rows:
id lat1 lon1 lat2 lon2
I studied a bit and found the Dask
Python library which seemed to do what I was looking for. This is a reduced part of the code I am using:
imports ...
def graphhopper(lat1, lon1, lat2, lon2):
res = requests.get("http://localhost:8989/route?point=" + lat1 + "," + lon1 + "&point=" + lat2 + "," + lon2)
data = res.json()
return data['paths'][0]['heading']
df = pd.read_csv('test.csv')
nCores = cpu_count()
ddf = dd.from_pandas(df, npartitions=nCores)
ddf['result'] = ddf.apply(lambda x: request_function(x.location_raw_lat, x.location_raw_lon, x.lat_lead, x.lon_lead), meta=(None, 'float64'), axis =1)
new_df = ddf.compute(scheduler='threads')
This actually manage to make 4 parallel requests per seconds (my machine has 4 core) which it looked reasonable.
What I don't understand is: why when I partition the starting dataframe with more chuncks (ex. dd.from_pandas(df, npartitions=100)
), I am not able to parallelize further and it submits always 4 requests per second?
Of course, this isn't acceptable because I need to handle 1 million requests and it would take far too long. How can I manage to increase the number of requests per second?
What about Spark? I tryied to use Pyspark on local mode but performances were even worst (1 request per second) with this simple configuration:
pyspark.sql.SparkSession \
.builder \
.appName("Test") \
.config("spark.warehouse.dir", "C:\\temp\\hive") \
.config('spark.sql.session.timeZone', 'UTC') \
.getOrCreate()
I am aware that something could be missing but I don't know if it does make sense to specifies executors and their memory on local (I am actually pretty new to Spark so it's possible that I misunderstood something).
More in general, the goal is the add a column with the calculation that result from an generic REST API call (could be Google APIs or whatever), how can I manage to submit multiple request for big dataframes so that I could fulfil the task? Any help with these approach or suggestion of an alternative one? Thanks