4

Greetings everyone, I'm writing a python program that requires making 1000+ cURL requests. It generates a cURL request and fetches some JSON and then processes it, and does this for over 1000+ URLs. If I try to do it conventionally, it takes some 20 minutes, but it needs to be done within 3 minutes.

So after a few hours of research, the most efficient solution I found was multithreading combined with Keep-Alive TCP connection.

So basically I'm trying to retrieve some information about a few products from a website, through web-scrapping.

The below program illustrates the SAME

import requests
import json
import time

s = requests.Session()

def getInfo(productName):
        # this try block tries to get information and then parse it and then display a few  
        # parameters about the particular product...
        try:

            # this is just an example URL...
            URL = "www.example.com/products/"+productName
            r = s.get(URL,headers)
            result = json.loads(r.text)
            print(result['information'])

        except json.decoder.jsonDecodeError:


            print("Unable to process data for " + productName)

products = [product1, product2, product3... productN]
counter = 1
mainThread = threading.current_thread()

for product in products:

        # this if block checks if this is the fifth iteration of the for loop...
        # if yes then change tcp connection...

        if counter%5 == 0:

            # wait until all other threads except the main thread are completed, cause we don't 
            # want to drop the connection in the middle of a request...

            threads = threading.enumerate()
            for thread in threads:

                if thread is mainThread:
                    continue

                thread.join()

            print("Connection Switched")
            # establish a new connection...
            s = requests.Session()
            # hold on for a sec
            time.delay(1)
        # start a new thread for getting info about the current 'prouct'
        thread = threading.Thread(target=getInfo,args=(product,))
        thread.start()

print("Done")

Note that the code was an simplified version of the real code...

Anyways...

I don't know for what reason, my program creates a new TCP connection for each product. There isn't much in logs, just the basic stuff...

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.example.com
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (2): www.example.com
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (3): www.example.com
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (4): www.example.com
...

Even after hours of searching, I can't seem to find an appropriate enough solution.

Here are some of the things that I have TRIED

The above two links are just related questions, so basically I haven't tried any sensible solution yet.

It would really be appreciated if you could help me with this weird snag...

Thanks in advance

HufF867
  • 103
  • 9

1 Answers1

1

I'm new to this library and just encountered this issue.

I solved it like this:

with sessions.Session() as session:
    response = session.request("POST", my_url, data=my_data)

This makes everything much faster as it reuses the connection you make at the beginning.

GarethD
  • 112
  • 1
  • 8