I have a text file containing several million URLs and I have to run a POST request for each of those URLs. I tried to do it on my machine but it is taking forever so I would like to use my Spark cluster instead.
I wrote this PySpark code:
from pyspark.sql.types import StringType
import requests
url = ["http://myurltoping.com"]
list_urls = url * 1000 # The final code will just import my text file
list_urls_df = spark.createDataFrame(list_urls, StringType())
print 'number of partitions: {}'.format(list_urls_df.rdd.getNumPartitions())
def execute_requests(list_of_url):
final_iterator = []
for url in list_of_url:
r = requests.post(url.value)
final_iterator.append((r.status_code, r.text))
return iter(final_iterator)
processed_urls_df = list_urls_df.rdd.mapPartitions(execute_requests)
but it is still taking a lot of time, how can I make the function execute_requests more efficient launching the requests in each partition asynchronously for example?
Thanks!