Effectively requesting and processing multiple HTML files with Python

Question

I am writing a tool which fetches multiple HTML files and processes them as text:

for url in url_list:
    url_response = requests.get(url)
    text = url_response.text
    # Process text here (put in database, search, etc)

The problem is that this is pretty slow. If I just needed a simple resonse I could have used grequests, but since I need to get the content of the HTML file, that seems not to be an option. How can I fasten this up?

Thanks in regard!

score 1 · Answer 1 · answered Nov 26 '15 at 07:07

Use a thread for each request:

import threading
import urllib2

url_list = ["url1", "url2"]

def fetch_url(url):
    url_response = requests.get(url)
    text = url_response.text

threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

7stud · Accepted Answer · 2015-11-26T09:26:34.540

1

import requests
from multiprocessing import Pool

def process_html(url):
    url_response = requests.get(url)
    text = url_response.text
    print(text[:500])
    print('-' * 30)

urls = [
    'http://www.apple.com',
    'http://www.yahoo.com',
    'http://www.google.com',
    'http://www.apple.com',
    'http://www.yahoo.com',
    'http://www.google.com',
    'http://www.apple.com',
    'http://www.yahoo.com',
    'http://www.google.com',
]

with Pool(None) as p:  #None => uses cpu.count()
    p.map(process_html, urls)  #This blocks until all return values from process_html() have been collected.

edited Nov 26 '15 at 09:26

answered Nov 26 '15 at 07:08

7stud

46,922
14
101
127

What is the main difference between threading and multiprocessing here and why is this solution better? – AdHominem Nov 26 '15 at 07:26
Threads execute within a single process, and execution shifts rapidly between threads. When a thread has to wait--as with http requests--another thread can execute. As a result, a threaded program making http requests will execute faster than a non-threaded program. With multiprocessing and multi core computers, multiple processes can actually execute at the same time on different cores, which can be even faster. – 7stud Nov 26 '15 at 09:30
Thanks for your help! `with Pool(5) as p:` produces an `AttributeError: __exit__`. It works tho when I assign `p = Pool(5)`. Why is that? And more important: Why can I set an arbitrary number of worker processes even if my CPU can only handle 4 processes at a time? – AdHominem Nov 26 '15 at 12:35
Has chosing multiprocessing over multithreading something to do with GIL limiting the usefulness of threads in CPython? – AdHominem Nov 26 '15 at 13:34
@AdHominem, `AttributeErrror: __exit__` --> You are using `python 2.7-`. `__enter__()` and `__exit__()` are invoked on entry to and exit from the body of the with statement. Those methods were added to the Pool class in python3. For a Pool, `__exit__()` automatically terminates the processes when all tasks complete. – 7stud Nov 26 '15 at 19:33
*Why can I set an arbitrary number of worker processes*-- So that you can leave some cores open. I'd guess that if you specify more than the number of cores your computer has, then it uses all of them. If you specify the number as `None`, then python will check how many cores the computer has and use them all. – 7stud Nov 26 '15 at 19:38
*Has chosing multiprocessing over multithreading something to do with GIL limiting the usefulness of threads in CPython?* See here: http://jessenoller.com/blog/2009/02/01/python-threads-and-the-global-interpreter-lock – 7stud Nov 26 '15 at 19:57
Also, here: http://stackoverflow.com/questions/2986931/how-do-smp-cores-processes-and-threads-work-togethre-exactly. So, what I said in my first comment isn't correct. – 7stud Nov 26 '15 at 20:06

score 0 · Answer 3 · edited May 23 '17 at 10:31

0

You need to use threading and put requests.get(...) to fetch each URL in diff threads i.e. in parallel.

See these two answers on SO for example and usage:

edited May 23 '17 at 10:31

Community

1
1

answered Nov 26 '15 at 07:04

aneroid

12,983
3
36
66

imdadhussain · Answer 4 · 2015-12-17T07:02:42.750

-1

import threading
import urllib2

url_list = ["url1", "url2"]

def fetch_url(url):
    url_response = requests.get(url)
    text = url_response.text

    threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
    for thread in threads:
          thread.start()
    for thread in threads:
         thread.join()

edited Dec 17 '15 at 07:02

answered Nov 26 '15 at 07:21

imdadhussain

61
7

Wow, same variable names as @bigOther and everything! – 7stud Nov 26 '15 at 09:13

Effectively requesting and processing multiple HTML files with Python

4 Answers4