0

Is there any way to speed up a web-scraper by having multiple computers contribute to processing a list of urls? Like computer A takes urls 1 - 500 and computer B takes urls 501 - 1000, etc. I am looking for a way to build the fastest possible web scraper with resources available to everyday people.

I am already using multiprocessing from the grequests module. Which is gevent + request combined.

This scraping does not need to be run constantly, but at a specific time each day in the morning (6 A.M.), and be done near as soon as it starts. I am looking for something quick and punctual.

Also I am looking through urls for retail stores (i.e.: target, bestbuy, newegg, etc), and using it to check what items are in stock for the day.

This is a code segment for grabbing those urls in the script I'm trying to put together:

import datetime
import grequests
thread_number = 20
nnn = int(len(product_number_list)/100)
float_nnn = (len(product_number_list)/100)
# Product number list is a list of product numbers, too big for me to include the full list. Here are like three:
product_number_list = ['N82E16820232476', 'N82E16820233852', 'N82E16820313777']
base_url = 'https://www.newegg.com/Product/Product.aspx?Item={}'
url_list = []
for number in product_number_list:
    url_list.append(base_url.format(product_number_list))
# The above three lines create a list of urls.
results = []
appended_number = 0
for x in range(0, len(product_number_list), thread_number):
    attempts = 0
    while attempts < 10:
        try:
            rs = (grequests.get(url, stream=False) for url in url_list[x:x+thread_number])
            reqs = grequests.map(rs, stream=False, size=20)
            append = 'yes'
            for i in reqs:
                if i.status_code != 200:
                    append = 'no'
                    print('Bad Status Code. Nothing Appended.')
                    attempts += 1
                    break
            if append == 'yes':
                appended_number += 1
                results.extend(reqs)
                break
        except:
            print('Something went Wrong. Try Section Failed.')
            attempts += 1
            time.sleep(5)
    if appended_number % nnn == 0:
        now = datetime.datetime.today()
        print(str(int(20*appended_number/float_nnn)) + '% of the way there at: ' + str(now.strftime("%I:%M:%S %p")))
    if attempts == 10:
        print('Failed ten times to get urls.')
        time.sleep(3600)
if len(results) != len(url_list):
    print('Results count is off. len(results) == "' + str(len(results)) + '". len(url_list) == "' + str(len(url_list)) + '".')

this is not my code, it is sourced from these two links:

Using grequests to make several thousand get requests to sourceforge, get "Max retries exceeded with url"

Understanding requests versus grequests

halfer
  • 19,824
  • 17
  • 99
  • 186
  • "Yes" is the answer to your question, although I doubt that's helpful. Have you tried to implement this? Got any code to show (concerning parallelization)? – TemporalWolf Jun 01 '18 at 21:07
  • Yes, it's called a botnet. Avoid making them. Bombarding web services with many calls is not a good idea. – GKFX Jun 01 '18 at 21:07
  • 1
    @GKFX I'd say botnets generally imply the other computers are illicitly under your control. Turning your 5 computers on your local network into a "botnet" is not the traditional meaning of the word. – TemporalWolf Jun 01 '18 at 21:10
  • 1
    `grequests` is not multiprocessing; it's running everything in a single process, in a single thread, on a single core, using a whole bunch of "threadlets". So, unless you have a non-hyperthreaded single-core processor (which I doubt), you can already speed things up by just using your other cores, without needing to drag in other machines. But unless the bottleneck is your CPU, that won't help. If it's your NIC or your OS, multiple computers will help. But if it's your LAN, or your router, or your upstream connection, even that won't do any good. – abarnert Jun 01 '18 at 21:10
  • 3
    What might be a good idea is interspersing calls to different websites (so you contact e.g. Newegg and Bestbuy in two different threads at the same time) as then you avoid scraping any one webservice too intensively. @TemporalWolf I'm exaggerating slightly for comic effect. You are of course right. – GKFX Jun 01 '18 at 21:11
  • @TemporalWolf. I have grequests telling python to get 20 urls at a time. Is that parallelization? Or is there something else involved? – Random Programmer Jun 01 '18 at 21:12
  • It's not parallel as in CPU parallelism. But that doesn't matter unless CPU if your bottleneck. You really have to know where the limits actual are before figuring out how to scale things up. – abarnert Jun 01 '18 at 21:14
  • Meanwhile, if you look at what "web download accelerators" do (or did, back when people used them), they usually have something like a pool of N total requests but no more than M at a time to the same host. Traditional numbers were N=8-16 and M=2-4, but they may be too low nowadays. Anyway, this way, if you're saturating a particular host—or getting rate-limited by them—this avoids slowing down all of your other downloads (which can still be a real problem today). – abarnert Jun 01 '18 at 21:16
  • first you should use multiprocessing and spin up X number of processes (where X == number of cores).... then you should spin up Y number of threads i would imagin you could run 4 cores with 25 threads each and process 100 urls in *roughly* the time it takes to do one url – Joran Beasley Jun 01 '18 at 21:16
  • abarnert I am running an: "Intel(R) Core(TM) i5-6500 CPU @3.20GHz (4 CPUs), ~3.2GHz". I am pretty sure it is a quadcore processor, but I do not think it has hyper-threading. – Random Programmer Jun 01 '18 at 21:16
  • Joran Beasley, how would I do that? – Random Programmer Jun 01 '18 at 21:17
  • 2
    @RandomProgrammer The important question is: are you actually blocked on CPU power? When your program is running, use your Activity Monitor or Task Manager or whatever to see your CPU usage. If one core is at 100% and the others are doing nothing, going multiprocessor will help. If one core is at 35% and the others are doing nothing, CPU isn't your problem, so going multiprocessor will not help, and you'll need to look for other ways to scale. (It's even better to look at the CPU usage of your particular program, rather than the system as a whole, but for a quick&dirty check…) – abarnert Jun 01 '18 at 21:19
  • 1
    Also, from what it looks like, your script potentially hits a server 200 times as fast as that server can respond... that's a good way to get your IP blacklisted. – TemporalWolf Jun 01 '18 at 21:21
  • I went into CPU usage with resource monitor. It is at ~10% usage with 90% frequency. I looked at all the graphs, and they are all at about 20%. I do not thing CPU usage is the problem. Is there a python module that allows computers to sync up and tackle the same script together? – Random Programmer Jun 01 '18 at 21:23
  • TemporalWolf, I thought of that and am planning on using a proxy. I am doing an average of about 16 requests per second. – Random Programmer Jun 01 '18 at 21:25
  • 1
    @RandomProgrammer Which means your proxy will get throttled/blacklisted. It's better to play by the rules in most cases. You also likely don't have to scrape the pages, as many sites have APIs for programs to use (including [target](https://developer.target.com/), [bestbuy](https://developer.bestbuy.com/) and [newegg](https://stackoverflow.com/questions/8265061/newegg-api-access-for-price-inventory-json-xml?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa)). – TemporalWolf Jun 01 '18 at 21:27
  • What if the website does not have an API. – Random Programmer Jun 01 '18 at 21:32
  • Have you tried contacting the website admins to see if they can assist you with your data collection needs? – MxLDevs Jun 01 '18 at 21:40
  • @ThatUmbrellaGuy uh. I looked it up. the sec does not offer that. – Random Programmer Jun 01 '18 at 21:43

0 Answers0