1

In the code below, I am considering using mutli-threading or multi-process for fetching from url. I think pools would be ideal, Can anyone help suggest solution..

Idea: pool thread/process, collect data... my preference is process over thread, but not sure.

import urllib

URL = "http://download.finance.yahoo.com/d/quotes.csv?s=%s&f=sl1t1v&e=.csv"
symbols = ('GGP', 'JPM', 'AIG', 'AMZN','GGP', 'JPM', 'AIG', 'AMZN')
#symbols = ('GGP')

def fetch_quote(symbols):
    url = URL % '+'.join(symbols)
    fp = urllib.urlopen(url)
    try:
        data = fp.read()
    finally:
        fp.close()
    return data

def main():
    data_fp = fetch_quote(symbols)
#    print data_fp
if __name__ =='__main__':
    main()
Merlin
  • 24,552
  • 41
  • 131
  • 206

4 Answers4

1

So here's a very simple example. It iterates over symbols passing one at a time to fetch_quote.

import urllib
import multiprocessing

URL = "http://download.finance.yahoo.com/d/quotes.csv?s=%s&f=sl1t1v&e=.csv"
symbols = ('GGP', 'JPM', 'AIG', 'AMZN','GGP', 'JPM', 'AIG', 'AMZN')
#symbols = ('GGP')

def fetch_quote(symbol):
    url = URL % '+'.join(symbol)
    fp = urllib.urlopen(url)
    try:
        data = fp.read()
    finally:
        fp.close()
    return data


def main():

    PROCESSES = 4
    print 'Creating pool with %d processes\n' % PROCESSES
    pool = multiprocessing.Pool(PROCESSES)
    print 'pool = %s' % pool
    print

    results = [pool.apply_async(fetch_quote, sym) for sym in symbols]

    print 'Ordered results using pool.apply_async():'
    for r in results:
        print '\t', r.get()

    pool.close()
    pool.join()

if __name__ =='__main__':
    main()
mluebke
  • 8,588
  • 7
  • 35
  • 31
  • There might be some issues if retrieved pages are quite large. `multiprocessing` uses inter-process communication mechanisms for exchanging information among processes. – Andrey Vlasovskikh Sep 08 '10 at 16:56
  • True, the above was for simple illustrative purposes only. YMMV, but I wanted to show how simple it was to take his code and make it multiprocess. – mluebke Sep 08 '10 at 17:00
  • I got this error: Creating pool with 4 processes pool = Ordered results using pool.apply_async(): Traceback (most recent call last): File "C:\py\Raw\Yh_Mp.py", line 36, in main() File "C:\py\Raw\Yh_Mp.py", line 30, in main print '\t', r.get() File "C:\Python26\lib\multiprocessing\pool.py", line 422, in get raise self._value TypeError: fetch_quote() takes exactly 1 argument (3 given) – Merlin Sep 08 '10 at 17:21
1

You have a process that request, several information at once. Let's try to fetch these information one by one.. Your code will be :

def fetch_quote(symbols):
    url = URL % '+'.join(symbols)
    fp = urllib.urlopen(url)
    try:
        data = fp.read()
    finally:
        fp.close()
    return data

def main():
    for symbol in symbols:
        data_fp = fetch_quote((symbol,))
        print data_fp

if __name__ == "__main__":
    main()

So main() call, one by one every url to get the data. Let's multiprocess it with a pool:

import urllib
from multiprocessing import Pool

URL = "http://download.finance.yahoo.com/d/quotes.csv?s=%s&f=sl1t1v&e=.csv"
symbols = ('GGP', 'JPM', 'AIG', 'AMZN','GGP', 'JPM', 'AIG', 'AMZN')

def fetch_quote(symbols):
    url = URL % '+'.join(symbols)
    fp = urllib.urlopen(url)
    try:
        data = fp.read()
    finally:
        fp.close()
    return data

def main():
    for symbol in symbols:
        data_fp = fetch_quote((symbol,))
        print data_fp

if __name__ =='__main__':
    pool = Pool(processes=5)
    for symbol in symbols:
        result = pool.apply_async(fetch_quote, [(symbol,)])
        print result.get(timeout=1)

In the following main a new process is created to request each symbols urls.

Note: on python, since the GIL is present, multithreading must be mostly considered as a wrong solution.

For documentation see: Multiprocessing in python

ohe
  • 3,461
  • 3
  • 26
  • 50
  • GIL is not an issue here because this task is definitely IO-bound. – Andrey Vlasovskikh Sep 08 '10 at 16:57
  • this method is much slower than no multi-processing. If use a list of 150 stocks then errors and very slow. Copy the list above so stocks equal 150. very slow, WOuld threading be better????? – Merlin Sep 08 '10 at 18:18
  • @user428862 a reason it gets slow when your list of symbols inc. in size is 'coz pool.apply_async serializes your list & passes it to your child process via pipes.As the size of passing args (to child processes) inc., you'll have overhead.In Windows there's not much we can do but try this approach in UNIX.in UNIX,fork(2) is used to spawn processes which essentially passes to the child process the entire parent process state. So if there is some global var in parent process,a child process will be able to access it. as `symbols` is already global don't pass it in args.hence no serialization... – Srikar Appalaraju Sep 08 '10 at 19:47
  • @movie, thanks, single thread is 2 sec, mutliprocess is 18 secs. for comparison how would I multithread this.... or will the same problem arise. – Merlin Sep 08 '10 at 19:55
0

Actually it's possible to do it without neither. You can get it done in one thread using asynchronous calls, like for example twisted.web.client.getPage from Twisted Web.

vartec
  • 131,205
  • 36
  • 218
  • 244
  • @vartec no need to go for any 3rd party extra packages. Python2.6+ onwards have pretty good in-built packages for this kind of purposes. – Srikar Appalaraju Sep 08 '10 at 16:46
  • 2
    Uh oh, someone mentioned Twisted, that means that all other answers are going to get downvoted. http://stackoverflow.com/questions/3490173/how-can-i-speed-up-fetching-pages-with-urllib2-in-python/3490191#3490191 – Nick T Sep 08 '10 at 18:02
  • @movieyoda: well, for obvious reasons (GAE, Jython) I like to stay compatible with 2.5. Anyway, maybe I'm missing out on something, what support for asynchronous web calls was introduced Python 2.6? – vartec Sep 09 '10 at 07:50
  • @Nick: unfortunately, because of GIL, Python sucks at threading (I know, function calls are done with GIL released), so you gain nothing from using threads instead of deferred async calls. On the other hand event driven programming rules even in cases when you actually could use threads (vide: ngnix, lighttpd), and obviously in case of Python (Twisted, Tornado). – vartec Sep 09 '10 at 07:54
  • @vartec if i am not wrong `multiprocessing` module was made available natively in Python from 2.6 onwards. I think it was called `pyprocessing` before that, a separate 3rd party module. – Srikar Appalaraju Sep 09 '10 at 13:50
  • @movieyoda: true, but I wouldn't call `multiprocessing` as package for same purpose as async calls – vartec Sep 09 '10 at 15:57
-1

As you would know multi-threading in Python is not actually multi-threading due to GIL. Essentially it's a single thread that's running at a given time. So in your program if you want multiple urls to be fetched at any given time, multi-threading might not be the way to go. Also after the crawl you store the data in a single file or some persistent db? The decision here could affect your performance.

multi-processes are more efficient that way but have the time & memory overhead of extra processes spawn. I have explored both these options in Python recently. Here's the url (with code) -

python -> multiprocessing module

Community
  • 1
  • 1
Srikar Appalaraju
  • 71,928
  • 54
  • 216
  • 264
  • 2
    IO code is run without acquiring the GIL. For IO-bound maps `threading` works well. – Andrey Vlasovskikh Sep 08 '10 at 16:49
  • all I wanted to say was while considering multi-threading in Python one needs to keep in mind the GIL. After getting the URL data, one may want to parse it (create DOM->CPU Intensive) or directly want to dump it into a file (IO Operation). In the latter the effect of GIL would be downplayed but in the former GIL played a prominent part in the efficiency of the program. Donno why people take it so offensive that they have to downvote the post... – Srikar Appalaraju Sep 08 '10 at 19:34
  • @user428862 threading & multiprocessing in Python essentially have the same Interfaces/API calls. You could just take my example & import threading instead of import multiprocessing. Give it a try & if you run into some problems I'll help you... – Srikar Appalaraju Sep 08 '10 at 19:37
  • I am new to python. I looked at ur code there are no imports. thanks for offer of help. – Merlin Sep 08 '10 at 20:01
  • Oh in that case it should be import multiprocessing. – Srikar Appalaraju Sep 09 '10 at 04:28