Fastest way to read and process 100,000 URLs in Python

Question

I have a file with 100,000 URLs that I need to request then process. The processing takes a non-negligible amount of time compared to the request, so simply using multithreading seems to only give me a partial speed-up. From what I have read, I think using the multiprocessing module, or something similar, would offer a more substantial speed-up because I could use multiple cores. I'm guessing I want to use some multiple processes, each with multiple threads, but I'm not sure how to do that.

Here is my current code, using threading (based on What is the fastest way to send 100,000 HTTP requests in Python?):

from threading import Thread
from Queue import Queue
import requests
from bs4 import BeautifulSoup
import sys

concurrent = 100

def worker():
    while True:
        url = q.get()
        html = get_html(url)
        process_html(html)
        q.task_done()

def get_html(url):
    try:
        html = requests.get(url, timeout=5, headers={'Connection':'close'}).text
        return html
    except:
        print "error", url
        return None

def process_html(html):
    if html == None:
        return
    soup = BeautifulSoup(html)
    text = soup.get_text()
    # do some more processing
    # write the text to a file

q = Queue(concurrent * 2)
for i in range(concurrent):
    t = Thread(target=worker)
    t.daemon = True
    t.start()
try:
    for url in open('text.txt'):
        q.put(url.strip())
    q.join()
except KeyboardInterrupt:
    sys.exit(1)

.@Gus you are not getting any speed-up using concurrent of 100. They all go out at same time, and surprise--they all come back waiting on OS processes. What you could do, is two step it. Pull everything with threading (i/o) local, then multi-process using cores*2. (or, you will have the same problem) — Merlin, Jun 15 '16 at 02:54
I see. So maybe I could break it into two scripts - have one script that is only using multithreading, and saving the raw HTML to files. Then have another that is multiprocessing, processing the files then deleting them when done. Not sure if this is the best solution though? — Gus, Jun 15 '16 at 03:00
Thats what I would do. -- and Do. --- otherwise you could switch languages and use nodejs to scrape -- but that a different process entirely. — Merlin, Jun 15 '16 at 03:03
The Link is misleading -- its does not say anything about the HTML processing. use lxml.html or requests. (and if you know jquery -- pyquery) — Merlin, Jun 15 '16 at 03:09

score -2 · Answer 1 · answered Jun 15 '16 at 07:44

If the file isn't bigger than your available memory, instead of opening it with the "open" method use mmap ( https://docs.python.org/3/library/mmap.html ). It will give the same speed as if you were working with memory and not a file.

with open("test.txt") as f:
    mmap_file = mmap.mmap(f.fileno(), 0)
    # code that does what you need
    mmap_file.close()

Fastest way to read and process 100,000 URLs in Python

1 Answers1

Linked