15

As the title suggests, I'm working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup.

As I have to read 5-10 sites, the page takes a while to load.

I'm just wondering if there's a way to read the sites all at once? Or anytricks to make it faster, like should I close the urllib2.urlopen after each read, or keep it open?

Added: also, if I were to just switch over to php, would that be faster for fetching and Parsi g HTML and XML files from other sites? I just want it to load faster, as opposed to the ~20 seconds it currently takes

Bibhas Debnath
  • 14,559
  • 17
  • 68
  • 96
Jack z
  • 153
  • 1
  • 1
  • 4
  • 2
    What makes you think `open` is slow? BeautifulSoup (as useful as it is) does far more work and I'd presume it is the bottleneck in the code. Did you try it without parsing? A code sample here would help. – msw Aug 12 '10 at 22:30
  • No just going PHP will not help. Python has tons of room to be fast, you just need to optimize your code. – bwawok Aug 12 '10 at 22:41
  • Oh shoot, so the threading thing isn't good? Thanks for thwayward up – Jack z Aug 12 '10 at 23:06
  • See my [answer](http://stackoverflow.com/questions/9007456/parallel-fetching-of-files/9010299#9010299) to [Parallel fetching of files](http://stackoverflow.com/q/9007456/95735) question. – Piotr Dobrogost Jun 22 '12 at 12:16

9 Answers9

18

I'm rewriting Dumb Guy's code below using modern Python modules like threading and Queue.

import threading, urllib2
import Queue

urls_to_load = [
'http://stackoverflow.com/',
'http://slashdot.org/',
'http://www.archive.org/',
'http://www.yahoo.co.jp/',
]

def read_url(url, queue):
    data = urllib2.urlopen(url).read()
    print('Fetched %s from %s' % (len(data), url))
    queue.put(data)

def fetch_parallel():
    result = Queue.Queue()
    threads = [threading.Thread(target=read_url, args = (url,result)) for url in urls_to_load]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    return result

def fetch_sequencial():
    result = Queue.Queue()
    for url in urls_to_load:
        read_url(url,result)
    return result

Best time for find_sequencial() is 2s. Best time for fetch_parallel() is 0.9s.

Also it is incorrect to say thread is useless in Python because of GIL. This is one of those case when thread is useful in Python because the the threads are blocked on I/O. As you can see in my result the parallel case is 2 times faster.

MERose
  • 4,048
  • 7
  • 53
  • 79
Wai Yip Tung
  • 18,106
  • 10
  • 43
  • 47
  • 2
    Yes, *if* this was the only way to fetch URLs, this would be closer to the correct way to use threads. However, async IO is *still* going to be faster, more maintainable, allow for deterministic debugging, and so on. Even without the GIL, it would be a superior solution. – habnabit Aug 12 '10 at 23:52
  • Oops, it looks like Dump Guy has retracted his answer. Hey I say you were going on the right track! – Wai Yip Tung Aug 12 '10 at 23:53
  • 2
    Aaron, can you provide a working example to show that async IO code is more maintainable? – Wai Yip Tung Aug 12 '10 at 23:56
  • @Wai Yip Tung, less code is going to be more maintainable than more code, especially if it's immediately obvious what that code does. Threads require more code to do less in order to work around the problems with shared-state concurrency (i.e. you need locks). You could use worker processes instead of worker threads in order to eliminate the shared-state part, but still, you could just use twisted and be done with it. – habnabit Aug 13 '10 at 00:07
  • Oh, I didn't realize twisted could fetch pages, thanks for metioning that! – Jack z Aug 13 '10 at 00:23
  • I've restored my post for reference here. – Dumb Guy Aug 13 '10 at 00:39
  • @Aaron Gallagher I've tried to learn how to use Twisted once, but ended up just using simple old sockets. If you would answer Wai's request for a working example, maybe i could change my opinion. That said, frameworks tend to contain more code than a custom solution requires. – Cees Timmerman Mar 22 '12 at 12:20
  • `urllib2` is not thread safe. In your `fetch_parallel function`, multiple threads call urllib2.urlopen. Is it going to cause some problem? – foresightyj Mar 02 '13 at 10:20
  • @foresightyj `urllib2.urlopen` is being instantiated newly each time it's used, so threads shouldn't be an issue. It would only be if a single urllib2 reference was being used across multiple threads that there'd be a problem. – fantabolous Oct 04 '16 at 00:45
  • @WaiYipTung aren't there better ways to go like aiohttp or asyncio? – Armen Sanoyan Jan 06 '21 at 16:27
9

Edit: Please take a look at Wai's post for a better version of this code. Note that there is nothing wrong with this code and it will work properly, despite the comments below.

The speed of reading web pages is probably bounded by your Internet connection, not Python.

You could use threads to load them all at once.

import thread, time, urllib
websites = {}
def read_url(url):
  websites[url] = urllib.open(url).read()

for url in urls_to_load: thread.start_new_thread(read_url, (url,))
while websites.keys() != urls_to_load: time.sleep(0.1)

# Now websites will contain the contents of all the web pages in urls_to_load
tzot
  • 92,761
  • 29
  • 141
  • 204
Dumb Guy
  • 3,336
  • 21
  • 23
  • The bottleneck is probably not even the internet connection but the remote server. However, BeautifulSoup is slow in any case. So it will add an extra delay. – Wolph Aug 12 '10 at 22:36
  • Oh okay, that makes sense. And I appreciate the example code thanks! – Jack z Aug 12 '10 at 22:45
  • 1
    -1 for threads *and* suggesting the `thread` module _and_ not doing any locking or *even* using the `Queue` module. You're just going to add way more complexity and locking overhead for no gain if you use threads. Even if this wasn't true, your code demonstrates that you don't really know how to use threads. – habnabit Aug 12 '10 at 22:53
  • The global interpreter lock should keep the dictionary assignment from happening simultaneously in two different threads. I should have mentioned it, though. – Dumb Guy Aug 12 '10 at 23:04
  • 2
    @Dumb Guy, no, it doesn't. The GIL isn't a replacement for proper locking, and also isn't present in all python implementations. Either way, mutating global state is a horrible, *horrible* way to communicate between threads. This is what the `Queue` module is for. – habnabit Aug 12 '10 at 23:08
3

It is maby not perfect. But when I need the data from a site. I just do this:

import socket
def geturldata(url):
    #NO HTTP URLS PLEASE!!!!! 
    server = url.split("/")[0]
    args = url.replace(server,"")
    returndata = str()
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.connect((server, 80)) #lets connect :p

    s.send("GET /%s HTTP/1.0\r\nHost: %s\r\n\r\n" % (args, server)) #simple http request
    while 1:
        data = s.recv(1024) #buffer
        if not data: break
        returndata = returndata + data
    s.close()
    return returndata.split("\n\r")[1]
Thomas15v
  • 272
  • 3
  • 11
2

Not sure why nobody mentions multiprocessing (if anyone knows why this might be a bad idea, let me know):

import multiprocessing
from urllib2 import urlopen

URLS = [....]

def get_content(url):
    return urlopen(url).read()


pool = multiprocessing.Pool(processes=8)  # play with ``processes`` for best results
results = pool.map(get_content, URLS) # This line blocks, look at map_async 
                                      # for non-blocking map() call
pool.close()  # the process pool no longer accepts new tasks
pool.join()   # join the processes: this blocks until all URLs are processed
for result in results:
   # do something

There are a few caveats with multiprocessing pools. First, unlike threads, these are completely new Python processes (interpreter). While it's not subject to global interpreter lock, it means you are limited in what you can pass across to the new process.

You cannot pass lambdas and functions that are defined dynamically. The function that is used in the map() call must be defined in your module in a way that allows the other process to import it.

The Pool.map(), which is the most straightforward way to process multiple tasks concurrently, doesn't provide a way to pass multiple arguments, so you may need to write wrapper functions or change function signatures, and/or pass multiple arguments as part of the iterable that is being mapped.

You cannot have child processes spawn new ones. Only the parent can spawn child processes. This means you have to carefully plan and benchmark (and sometimes write multiple versions of your code) in order to determine what the most effective use of processes would be.

Drawbacks notwithstanding, I find multiprocessing to be one of the most straightforward ways to do concurrent blocking calls. You can also combine multiprocessing and threads (afaik, but please correct me if I'm wrong), or combine multiprocessing with green threads.

2

As a general rule, a given construct in any language is not slow until it is measured.

In Python, not only do timings often run counter to intuition but the tools for measuring execution time are exceptionally good.

msw
  • 42,753
  • 9
  • 87
  • 112
2

Scrapy might be useful for you. If you don't need all of its functionality, you might just use twisted's twisted.web.client.getPage instead. Asynchronous IO in one thread is going to be way more performant and easy to debug than anything that uses multiple threads and blocking IO.

habnabit
  • 9,906
  • 3
  • 32
  • 26
  • Okay, I've heard about that being faster. Thanks! – Jack z Aug 12 '10 at 23:29
  • @msw, is my answer cut off in your browser? The full sentence is "Asynchronous IO in one thread is going to be way more performant and easy to debug than anything that uses multiple threads and blocking IO." – habnabit Aug 13 '10 at 00:08
  • I should have been more clear; sorry. The OP hasn't even made a case for needing asynchronous IO, and your philosophy of "get it right, first" noted above is a good stance. But I fear it isn't impressing the OP, oh well ;) – msw Aug 13 '10 at 01:30
1

1) Are you opening the same site many times, or many different site? If many different sites, I think urllib2 is good. If doing the same site over and over again, I have had some personal luck with urllib3 http://code.google.com/p/urllib3/

2) BeautifulSoup is easy to use, but is pretty slow. If you do have to use it, make sure to decompose your tags to get rid of memory leaks.. or it will likely lead to memory issues (did for me).

What do your memory and cpu look like? If you are maxing your CPU, make sure you are using real heavyweight threads, so you can run on more than 1 core.

bwawok
  • 14,898
  • 7
  • 32
  • 43
  • I'm accessing XML pages for Amazon, eBay, and Half. So while similar, the products and prices change – Jack z Aug 12 '10 at 22:31
  • Okay so then urllib2 is fine. You need to thread out your program to use heavyweight threads, and parse as efficiently as possibly. – bwawok Aug 12 '10 at 22:40
0

How about using pycurl?

You can apt-get it by

$ sudo apt-get python-pycurl
OTZ
  • 3,003
  • 4
  • 29
  • 41
0

First, you should try multithreading/multiprocessing packages. Currently, the three popular ones are multiprocessing;concurrent.futures and [threading][3]. Those packages could help you to open multi url at the same time, which could increase the speed.

More importantly, after using multithread processing, and if you try to open hundreds urls at the same time, you will find urllib.request.urlopen is very slow, and opening and read the context become the most time-consuming part. So if you want to make it even faster, you should try requests packages, requests.get(url).content() is faster than urllib.request.urlopen(url).read().

So, here I list two example to do fast multi url parsing, and the speed is faster than the other answers. The first example use classical threading package and generate hundreds thread at the same time. (One trivial shortcoming is it cannot keep the original order of the ticker.)

import time
import threading
import pandas as pd
import requests
from bs4 import BeautifulSoup


ticker = pd.ExcelFile('short_tickerlist.xlsx')
ticker_df = ticker.parse(str(ticker.sheet_names[0]))
ticker_list = list(ticker_df['Ticker'])

start = time.time()

result = []
def fetch(ticker):
    url = ('http://finance.yahoo.com/quote/' + ticker)
    print('Visit ' + url)
    text = requests.get(url).content
    soup = BeautifulSoup(text,'lxml')
    result.append([ticker,soup])
    print(url +' fetching...... ' + str(time.time()-start))



if __name__ == '__main__':
    process = [None] * len(ticker_list)
    for i in range(len(ticker_list)):
        process[i] = threading.Thread(target=fetch, args=[ticker_list[i]])

    for i in range(len(ticker_list)):    
        print('Start_' + str(i))
        process[i].start()



    # for i in range(len(ticker_list)):
    #     print('Join_' + str(i))    
    #     process[i].join()

    print("Elapsed Time: %ss" % (time.time() - start))

The second example uses multiprocessing package, and it is little more straightforward. Since you just need to state the number of pool and map the function. The order will not change after fetching the context and the speed is similar to the first example but much faster than other method.

from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
import time

os.chdir('file_path')

start = time.time()

def fetch_url(x):
    print('Getting Data')
    myurl = ("http://finance.yahoo.com/q/cp?s=%s" % x)
    html = requests.get(myurl).content
    soup = BeautifulSoup(html,'lxml')
    out = str(soup)
    listOut = [x, out]
    return listOut

tickDF = pd.read_excel('short_tickerlist.xlsx')
li = tickDF['Ticker'].tolist()    

if __name__ == '__main__':
    p = Pool(5)
    output = p.map(fetch_url, ji, chunksize=30)
    print("Time is %ss" %(time.time()-start))