1

I'm trying to scrape a website, but I want to also be cautious and not spam their website with the same IP.

I have list of resedential proxies, over 100k+. Ideally the site is very strict and will throw captcha if I make the request with same IP 5 times. I have a rotation proxy logic in place where I rotate, I have a decent PC, so I thought using

executor = concurrent.futures.ThreadPoolExecutor(
        max_workers=50
)

Where I have 50 worker, which I assume is CPU intensive, I want to be able to then make 50 unique request with proxy and have some sort of delay between those requests. As of now my code is really unscalable as I am not able to figure out how to add delay.

My current code:

def triggerPipeline():
    import sys
    fileData = json.loads(open('usernames.json').read()) // This consists of 19k names.
    start_time = time.time()
    logging.basicConfig(
        level=logging.INFO,
        format='%(threadName)10s %(name)18s: %(message)s',
        stream=sys.stderr,
    )
    # Create a limited thread pool.
    executor = concurrent.futures.ThreadPoolExecutor(
        max_workers=50,
    )
    event_loop = asyncio.get_event_loop()
    try:
        event_loop.run_until_complete(
            run_blocking_tasks(executor, fileData)
        )
    finally:
        event_loop.close()
    duration = time.time() - start_time

async def run_blocking_tasks(executor, fileData):
    log = logging.getLogger('run_blocking_tasks')
    log.info('starting')
    log.info('creating executor tasks')
    start_time = time.time()
    loop = asyncio.get_event_loop()
    blocking_tasks = [
        loop.run_in_executor(executor, scraper, i)
        for i in fileData
    ]
    log.info('waiting for executor tasks')
    completed, pending = await asyncio.wait(blocking_tasks)
    results = [t.result() for t in completed]
    duration = time.time() - start_time
    log.info('exiting')
    log.info('Printing Status....')
    printStats(fileData)

I have the function triggerPipeline which basically spawns 50 workers, then I call

event_loop.run_until_complete(
            run_blocking_tasks(executor, fileData)
        )

run_blocking_tasks basically is like an await, and within that I do the scraping job where I make requests etc.

I want to add some sort of await within scraper function.

def scraper(data):
    log = logging.getLogger('scrapePageNative')
    try:
        userName = data['petName']
        content = makeApiRequest(name=queryName, ageRange=age, page=None)
        return content
    except Exception as e:
        log.info(f'Found Exception! {e}')
        pass

Ideally, I am looking for a scaleable way - I don't mind the scraper being slower - but I do have these proxies which I am paying for and would like to take a use of it. makeApiRequest basically gets new proxy then makes a request to the source I am trying to scrape data from. As of now it throws a lot of a error and that is I am assuming because the server I am trying to scrape from is noticing it's some sort of DDos attack or something, I want to appear it like 40 people are in a session.

newb12431
  • 21
  • 3
  • 2
    *"I'm trying to scrape a website, but I want to also be cautious and not spam their website with the same IP."* - Then you **don't** want the fastest way to do this. Because the faster you go, the more likely that your requests will be viewed as a DOS attack. – Stephen C Nov 08 '20 at 06:53
  • But the flipside is that the you could go out and purchase a large number of IP addresses, set up servers on each one and .... – Stephen C Nov 08 '20 at 06:56
  • 2
    But seriously. If you are scraping and running into rate limiting, then there is a good chance that what you are doing is violating the Terms and Conditions of the site you are scraping. There are two issues with that: 1) they could come after you with more severe sanctions than rate limiting (e.g. IP blocking or lawsuits!), 2) the morality of leaching off someone else's free service to build a business is highly questionable. – Stephen C Nov 08 '20 at 07:00
  • 2
    Why not just use asyncio-based libraries instead of multithreading. A small [example](https://stackoverflow.com/a/63881674/13782669) – alex_noname Nov 08 '20 at 13:23
  • @StephenC So what's a good max_workers limit? I'm in a similar situation (without the DDOS aspect of OP), where I need to make many GET requests to a website. – user3932000 Dec 28 '20 at 19:07
  • There is no simple answer. You need to tune it. And you should also do this in consultation with the people who run the server, so that they can assess the impact on server performance (for you and for other users) of your crawling. – Stephen C Dec 29 '20 at 03:07
  • (You may think that you are not in a DDOS situation ... but doing this incorrectly has the potential to DOS yourself. While it is not illegal to DOS "yourself", etc, you are still liable to disrupt the service, annoy the server admins, other users, etc.) – Stephen C Dec 29 '20 at 03:11

0 Answers0