I'm trying to scrape a website, but I want to also be cautious and not spam their website with the same IP.
I have list of resedential proxies, over 100k+. Ideally the site is very strict and will throw captcha if I make the request with same IP 5 times. I have a rotation proxy logic in place where I rotate, I have a decent PC, so I thought using
executor = concurrent.futures.ThreadPoolExecutor(
max_workers=50
)
Where I have 50 worker, which I assume is CPU intensive, I want to be able to then make 50 unique request with proxy and have some sort of delay between those requests. As of now my code is really unscalable as I am not able to figure out how to add delay.
My current code:
def triggerPipeline():
import sys
fileData = json.loads(open('usernames.json').read()) // This consists of 19k names.
start_time = time.time()
logging.basicConfig(
level=logging.INFO,
format='%(threadName)10s %(name)18s: %(message)s',
stream=sys.stderr,
)
# Create a limited thread pool.
executor = concurrent.futures.ThreadPoolExecutor(
max_workers=50,
)
event_loop = asyncio.get_event_loop()
try:
event_loop.run_until_complete(
run_blocking_tasks(executor, fileData)
)
finally:
event_loop.close()
duration = time.time() - start_time
async def run_blocking_tasks(executor, fileData):
log = logging.getLogger('run_blocking_tasks')
log.info('starting')
log.info('creating executor tasks')
start_time = time.time()
loop = asyncio.get_event_loop()
blocking_tasks = [
loop.run_in_executor(executor, scraper, i)
for i in fileData
]
log.info('waiting for executor tasks')
completed, pending = await asyncio.wait(blocking_tasks)
results = [t.result() for t in completed]
duration = time.time() - start_time
log.info('exiting')
log.info('Printing Status....')
printStats(fileData)
I have the function triggerPipeline which basically spawns 50 workers, then I call
event_loop.run_until_complete(
run_blocking_tasks(executor, fileData)
)
run_blocking_tasks basically is like an await, and within that I do the scraping job where I make requests etc.
I want to add some sort of await within scraper function.
def scraper(data):
log = logging.getLogger('scrapePageNative')
try:
userName = data['petName']
content = makeApiRequest(name=queryName, ageRange=age, page=None)
return content
except Exception as e:
log.info(f'Found Exception! {e}')
pass
Ideally, I am looking for a scaleable way - I don't mind the scraper being slower - but I do have these proxies which I am paying for and would like to take a use of it. makeApiRequest basically gets new proxy then makes a request to the source I am trying to scrape data from. As of now it throws a lot of a error and that is I am assuming because the server I am trying to scrape from is noticing it's some sort of DDos attack or something, I want to appear it like 40 people are in a session.