Greetings everyone,
I'm writing a python program that requires making 1000+ cURL requests. It generates a cURL request and fetches some JSON
and then processes it, and does this for over 1000+ URLs. If I try to do it conventionally, it takes some 20 minutes, but it needs to be done within 3 minutes.
So after a few hours of research, the most efficient solution I found was multithreading combined with Keep-Alive TCP connection.
So basically I'm trying to retrieve some information about a few products from a website, through web-scrapping
.
The below program illustrates the SAME
import requests
import json
import time
s = requests.Session()
def getInfo(productName):
# this try block tries to get information and then parse it and then display a few
# parameters about the particular product...
try:
# this is just an example URL...
URL = "www.example.com/products/"+productName
r = s.get(URL,headers)
result = json.loads(r.text)
print(result['information'])
except json.decoder.jsonDecodeError:
print("Unable to process data for " + productName)
products = [product1, product2, product3... productN]
counter = 1
mainThread = threading.current_thread()
for product in products:
# this if block checks if this is the fifth iteration of the for loop...
# if yes then change tcp connection...
if counter%5 == 0:
# wait until all other threads except the main thread are completed, cause we don't
# want to drop the connection in the middle of a request...
threads = threading.enumerate()
for thread in threads:
if thread is mainThread:
continue
thread.join()
print("Connection Switched")
# establish a new connection...
s = requests.Session()
# hold on for a sec
time.delay(1)
# start a new thread for getting info about the current 'prouct'
thread = threading.Thread(target=getInfo,args=(product,))
thread.start()
print("Done")
Note that the code was an simplified version of the real code...
Anyways...
I don't know for what reason, my program creates a new TCP connection for each product
. There isn't much in logs, just the basic stuff...
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.example.com
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): www.example.com
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (3): www.example.com
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (4): www.example.com
...
Even after hours of searching, I can't seem to find an appropriate enough solution.
Here are some of the things that I have TRIED
The above two links are just related questions, so basically I haven't tried any sensible solution yet.
It would really be appreciated if you could help me with this weird snag...
Thanks in advance