2

For my bachelor thesis I need to grab some data out of about 40000 websites. Therefore I am using python requests, but at the moment it is really slow with getting a response from the server.

Is there anyway to speed it up and keep my current header setting? All tutorials I found where without a header.

Here is my code snipped:

def parse(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) '
                         'Chrome/39.0.2171.95 Safari/537.36'}
    r = requests.get(url, headers=headers)

    for line in r.iter_lines():
        ...
QDA
  • 63
  • 1
  • 2
  • 6
  • How about using `multiprocessing`? http://stackoverflow.com/questions/28393617/python-requests-module-multithreading – bsa Jul 30 '16 at 08:22
  • 1
    @bsa no need for multiprocessing, the overhead of new process is too big, threads are better for io bound. – Or Duan Jul 30 '16 at 08:25

3 Answers3

3

Well you can use threads since this is a I/O Bound problem. Using the built in threading library is your best choice. I used the Semaphore object to limit how many threads can run at the same time.

import time
import threading

# Number of parallel threads
lock = threading.Semaphore(2)


def parse(url):
   """
   Change to your logic, I just use sleep to mock http request.
   """

    print 'getting info', url
    sleep(2)

    # After we done, subtract 1 from the lock
    lock.release()


def parse_pool():
    # List of all your urls
    list_of_urls = ['website1', 'website2', 'website3', 'website4']

    # List of threads objects I so we can handle them later
    thread_pool = []

    for url in list_of_urls:
        # Create new thread that calls to your function with a url
        thread = threading.Thread(target=parse, args=(url,))
        thread_pool.append(thread)
        thread.start()

        # Add one to our lock, so we will wait if needed.
        lock.acquire()

    for thread in thread_pool:
        thread.join()

    print 'done'
Or Duan
  • 13,142
  • 6
  • 60
  • 65
  • What if a website is multiple times in that 40000 website list? (If I need data from more then 1 page) How can I not ddos them? Just waiting a random time before a request? – QDA Jul 30 '16 at 10:18
  • You will have to use dict: the keys will be the website's URL and the value will be `Semaphore` lock. That a bit more complicated. if the answer is right please accept it, thanks :) @QDA – Or Duan Jul 30 '16 at 11:57
0

You can use asyncio to run tasks concurrently. you can list the url responses (the ones which are completed as well as pending) using the returned value of asyncio.wait() and call coroutines asynchronously. The results will be in an unexpected order, but it is a faster approach.

import asyncio
import functools


async def parse(url):
    print('in parse for url {}'.format(url))

    info = await #write the logic for fetching the info, it waits for the responses from the urls

    print('done with url {}'.format(url))
    return 'parse {} result from {}'.format(info, url)


async def main(sites):
    print('starting main')
    parses = [
        parse(url)
        for url in sites
    ]
    print('waiting for phases to complete')
    completed, pending = await asyncio.wait(parses)

    results = [t.result() for t in completed]
    print('results: {!r}'.format(results))


event_loop = asyncio.get_event_loop()
try:
    websites = ['site1', 'site2', 'site3']
    event_loop.run_until_complete(main(websites))
finally:
    event_loop.close() 
CfourPiO
  • 351
  • 1
  • 16
  • What if a website is multiple times in that 40000 website list? (If I need data from more then 1 page) How can I not ddos them? – QDA Jul 30 '16 at 10:18
  • Will using `repeat` from `itertools` help? `from itertools import repeat`; `websites.extend('sitex', 100)`. What exactly is your need? Are you setting up the pages separately and dynamically ? This repeat only helps in appending a particular website for a specific number of times. – CfourPiO Jul 30 '16 at 10:58
-1

i think it's a good idea to use mutil-thread like threading or multiprocess, or you can use grequests(async requests) due to gevent

kute279
  • 1
  • 1