1

Im currently working on a scraper where I am trying to figure out how I can assign proxies that are avaliable to use, meaning that if I use 5 threads and if thread-1 uses proxy A, no other threads should be able to access proxy A and should try do randomize all available proxy pool.

import random
import time
from threading import Thread

import requests

list_op_proxy = [
    "http://test.io:12345",
    "http://test.io:123456",
    "http://test.io:1234567",
    "http://test.io:12345678"
]

session = requests.Session()


def handler(name):
    while True:
        try:
            session.proxies = {
                'https': f'http://{random.choice(list_op_proxy)}'
            }
            with session.get("https://stackoverflow.com"):
                print(f"{name} - Yay request made!")

            time.sleep(random.randint(5, 10))
        except requests.exceptions as err:
            print(f"Error! Lets try again! {err}")
            continue

        except Exceptions as err:
            print(f"Error! Lets debug! {err}")
            raise Exception


for i in range(5):
    Thread(target=handler, args=(f'Thread {i}',)).start()

I wonder how I can create a way where I can use proxies that are available and not being used in any threads and "block" the proxy to not be able to be used to other threads and release once it is finished?

PythonNewbie
  • 1,031
  • 1
  • 15
  • 33

1 Answers1

1

One way to go about this would be to just use a global shared list, that holds the currently active proxies or to remove the proxies from the list and readd them after the request is finished. You do not have to worry about concurrent access on the list, since CPython suffers from the GIL.

proxy = random.choice(list_op_proxy)
list_op_proxy.remove(proxy)
session.proxies = {
    'https': f'http://{proxy}'
}
# ... do request

list_op_proxy.append(proxy)

you could also do this using a queue and just pop and add to make it more efficient.

Using a Proxy Queue

Another option is to put the proxies into a queue and get() a proxy before each query, removing it from the available proxies, and the put() it back after the request has been finished. This is a more efficient version of the above mentioned list approach.

First we need to initialize the proxy queue.


proxy_q = queue.Queue()
for proxy in proxies:
    proxy_q.put(proxy)

Within the handler we then get a proxy from the queue before performing a request. We perform the request and put the proxy back to the queue.
We are using block=True, such that the queue blocks the thread if there is no proxy currently available. Otherwise the thread would terminate with a queue.Empty exception once all proxies are in use and a new one should be aquired.

def handler(name):
    global proxy_q
    while True:
        proxy = proxy_q.get(block=True) # we want blocking behaviour
        # ... do request
        proxy_q.put(proxy)
        # ... response handling can be done after proxy put to not
        # block it longer than required
        # do not forget to define a break condition

Using Queue and Multiprocessing

First you would initialize the manager and put all your data into the queue and initialize another structure for collecting your results (here we initialize a shared list).

manager = multiprocessing.Manager()
q = manager.Queue()
for e in entities:
   q.put(e)
print(q.qsize())
results = manager.list()

The you initialize the scraping processes:

for proxy in proxies:
    processes.append(multiprocessing.Process(
        target=scrape_function,
        args=(q, results, proxy)
        daemon=True))

And then start each of them

for w in processes:
    w.start()

lastly you join every process to ensure that the main process is not terminated before the subprocesses are finished

for w in processes:
    w.join()

Inside the scrape_function you then simply get one item at a time and perform the request. The queue object in the default configuration raises an queue.Empty error when it is empty, so we are using an infinite while loop with a break condition catching the exception.

def scrape_function(q, results, proxy)
    session = requests.Session()
    session.proxies = {
        'https': f'http://{proxy}'
    }
    while True:
        try:
            request_uri = q.get(block=False)
            with session.get("https://stackoverflow.com"):
                print(f"{name} - Yay request made!")
                results.append(result)
            time.sleep(random.randint(5, 10))
        except queue.Empty:
            break

The results of each query are appended to the results list, which is also shared among the different processes.

  • Queue would sound awesome I believe, but I have no idea how I can apply it unfortunately with this scenario, maybe you know? :( I dont know what I think about appending/removing. I was thinking maybe to do like a "available/busy" status for each proxy but there is a chance that two threads uses the same proxies if they use it at the same time, I think queue would be very great but how? – PythonNewbie Jul 04 '21 at 12:44
  • As mentioned above, the GIL prevents concurrent access on a list in CPython (more information [here](https://stackoverflow.com/a/6319267/8896833)). Since `append` and `remove` should be atomic to the best of my knowledge you should have no problems. If you really want to be sure you still can use a [`Lock`](https://docs.python.org/3/library/threading.html#lock-objects). This basically is the same as holding a `busy` or `available` status. – Vincent Scharf Jul 04 '21 at 12:52
  • Oh really? but isnt then queue also a better suggestion where you just pull the data from the queue which means two threads will never happend to have two same proxies? – PythonNewbie Jul 04 '21 at 13:00
  • Yeah, that is actually what I was about to suggest. You can just assign each thread to one proxy and then put the data in a queue you are reading from. This also works with [`processes`](https://docs.python.org/3/library/multiprocessing.html). If you want "real" parallelism in python you have to use multiprocessing instead of multithreading. To create a shared queue between subprocesses you could then use a [`manager`](https://docs.python.org/3/library/multiprocessing.html#managers) and create the queue via it. – Vincent Scharf Jul 04 '21 at 13:05
  • Sounds abit out of my knowledge, is there a small chance that you might are able to show me an example of how it can look like with queues? – PythonNewbie Jul 04 '21 at 13:06
  • Before you go on! i will unfortently need to use threading for different purpose so the multiprocessing might not needed to be written here unless you want! – PythonNewbie Jul 04 '21 at 13:10
  • I have added an example. The same principles also apply with threading, just that you do not have to create the queue viat the manager, but instead you can just create it directly. The rest of the operations are pretty much the same. – Vincent Scharf Jul 04 '21 at 13:25
  • Oh wow! that looks awesome! but dont you need to do q.put(..) in the scraper? Or is it because we add it into the results list instead? – PythonNewbie Jul 04 '21 at 13:26
  • I assumed that each process/thread uses one proxy exclusively now. The queue contains your request uris (or any other data that you need to perform the requests), the results list is there to collect the results of your queries. As I said, with threads you do not need to use a manager, but can instead just use the usual object. – Vincent Scharf Jul 04 '21 at 13:29
  • Yeah, now its alot better to understand. I was thinking now after your update that I could use queue straight off for the proxies queue instead and do q.get to get the proxy and q.put when its finished? – PythonNewbie Jul 04 '21 at 13:32
  • Yeah, sure that is basically what we have discussed before using the `list`. I will add another section to my answere using a proxy queue. – Vincent Scharf Jul 04 '21 at 13:33
  • That would be awesome! I will of course set ur answer as the answer! Very well done said too! – PythonNewbie Jul 04 '21 at 13:36
  • 1
    Thank you very much, the edits are added :) – Vincent Scharf Jul 04 '21 at 13:44
  • Glad for your examples! very well done! Legend – PythonNewbie Jul 04 '21 at 13:48